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Abstract  of  the  Dissertation 

The  Bayesian  Approach  to  Recursive  State  Estimation: 
Implementation  and  Application 

by 

Stuart  Charles  Kramer 

Doctor  of  Philosophy  in  Engineering  Sciences  (Systems  Science) 
University  of  California,  San  Diego,  1985 
Professor  Harold  W.  Sorenson,  Chairperson 


In  Bayesian  estimation,  the  objective  is  to  calculate  the  complete  density  func¬ 
tion  for  an  unknown  quantity  conditioned  on  noisy  observations  of  that  quantity.  This 
work  considers  recursive  estimation  of  a  nonlinear  discrete-time  system  state  using  suc¬ 
cessive  observations.  The  formal  recursion  for  the  density  function  is  easily  written,  but 
generally  there  is  no  closed  form  solution.  The  numerical  solution  proposed  here  is 
obtained  by  modifying  the  recursion  and  using  a  simple  piece-wise  constant  approxima^ 
tion  to  the  density  functions.  The  critical  part  of  the  algorithm  then  becomes  a  discrete 
linear  convolution  that  can  be  realized  using  FFT’s.^  Hence  this  approach  requires  only 
0(N/n(N))  (where  N  is  the  number  of  grid  points  in  the  approximation)  operations  for 
the  convolution  instead  of  the  O(N^)  of  previous  solutions.  The  approach  also  allows 
detailed  analysis  of  error  propagation  through  the  algorithm.  This  allows  characteriza¬ 
tion  of  the  situations  leading  to  potentially  large  errors  and  derivation  of  the  bound  on 
the  maximum  error  growth.  The  algorithm  is  shown  to  be  quite  stable  by  comparing  its 
long-term  performance  to  a  Kalman  filter  for  a  linear  system  with  Gaussian  noises. 
Two  potentially  broad  uses  for  the  technique  are  then  explored.  First,  the  technique 
can  be  used  to  gain  insight  into  stochastic  system  behavior  by  visualizing  the  propaga¬ 
tion  of  the  density  through  time.  Second,  it  provides  a  benchmark  for  traditional  point 
estimators,  since  the  true  optimal  estimate  for  any  given  loss  function  can  be  calculated 
from  the  conditional  density.  To  illustrate  these  uses,  we  consider  two  nonlinear  appli¬ 
cations:  simultaneous  state/parameter  estimation  and  bearings-only  tracking.  In  both 
cases  the  density  behavior  is  analyzed,  and  then  displayed  graphically.  The  mean  of  the 
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1.  INTRODUCTION 


1.1.  The  Bayesian  Approach  to  Recursive  State  Estimation 

Tlic  Bayesian  approacii  to  slate  esi iiiiat ioti  is  l)ase(i  in  tiie  Bayesian  philosophy 
of  slatisiies.  wliich  is  a  subject  unto  itself.  For  ihe  sake  of  brevity,  we  will  only  briefly 
discuss  some  facets  of  the  Bayesian  approach  which  have  direct  bearing  on  this  work. 
For  more  detailed  discussions  of  and  justifications  for  the  approach,  the  reader  must 
refer  to  the  considerable  literature  on  the  subject.  References  ll]-l3]  and  |35j-j37)  are  a 
sampling  of  applicable  works.  The  interested  reader  will  undoubtedly  find  many  addi¬ 
tional  useful  references  with  minimal  effort. 

f^or  our  purposes  it  is  enough  to  note  that  there  are  two  distinguishing  features 
of  the  Bayesian  approach.  The  first,  and  less  controversial,  is  the  recognition  that  the 
ultimate  use  of  estimation  (or  more  correctly,  statistical  inference)  is  to  provide  a 
rational  basis  for  decision  making  under  uncertainty,  and  that  this  function  is  the  first 
stage  of  the  decision  process.  The  second  stage  is  picking  a  policy  to  maximize  an 
expected  value  or  minimize  an  expected  loss.  Bayesian  statistics  provides  the  frame¬ 
work  for  the  calculation,  manipulation,  and  interpretation  of  probability  densities  as  a 
basis  for  decision  making.  Decision  theory  provides  the  utility  of  loss  function,  calcu¬ 
lates  expectations,  and  maximizes  or  minimizes  in  the  process  of  making  a  decision.  As 
an  example,  consider  the  optimal  stochastic  control  problem.  The  decision  is  what  con¬ 
trol  policy  to  adopt  in  order  to  minimize  the  expected  value  of  some  cost  functional. 
Calculating  tlie  density  function  to  be  used  in  taking  the  expectation  is  independent  of 
the  choice  of  cost  functional.  'Fhe  Bayesian  viewpoint  merely  makes  this  separation 
explicit. 

The  second  feature  of  this  approach  amounts  to  the  replacement  of  the  word 
‘random’  with  the  mori'  general  term  ‘uncertain’  in  our  probabilistic  discussions.  It  does 
not  matter  wlu'ther  the  uncertainty  ari.ses  from  a  truly  random  process  or  from  a  deter¬ 
ministic  process  that  is  sufficiently  complicated  or  poorly  understood  that  we  are  uncer¬ 
tain  of  its  slate.  All  typi's  of  uncertainty  eipiate  to  randomness.  Broadening  our  defini¬ 
tion  of  random  in  this  way  is  not  as  far-fetche<l  as  it  may  seem;  even  classical  statisti¬ 
cians  do  it  at  times.  As  an  example,  consider  the  over-used  coin  flip.  It  is  not  really  a 
random  event  in  the  classical  sen.se.  That  is,  if  I  could  in  fact  exactly  duplicate  the 
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condilions  of  a  toss,  all  the  applied  forces  and  the  timing  of  the  catch  and  all  the  other 
factors,  then  1  would  get  the  same  result  each  time.  Taken  the  other  way,  if  I  had  suffi¬ 
cient  knowledge  of  the  physical  factors  affecting  the  coin,  I  could  predict  the  outcome 
with  complete  certainty.  We  are  only  willing  to  accept  it  as  random  because  this  is 
such  a  complicated  system  (person  and  coin  together)  that  we  generally  have  insuffi¬ 
cient  knowledge  to  predict  its  outcome.  As  another  example  closer  to  an  engineer’s 
heart,  look  at  computer  generated  pseudo-random  numbers.  For  most  practical  pur¬ 
poses  we  accept  these  as  random,  when  in  fact  they  are  generated  by  a  (very  compli¬ 
cated)  purely  deterministic  procedure.  A  more  extreme  example  is  the  problem  of 
parameter  identification.  For  a  given  system,  the  unknown  parameter  has  a  definite 
value.  If  we  could  measure  it,  it  would  become  a  known  constant.  It  is  clearly  not  a 
random  variable  in  the  classical  sense,  yet  the  standard  engineering  practice  is  to  treat 
it  as  such.  If  anything,  this  broader  interpretation  of  random  is  probably  closer  to  most 
engineers  intuitive  understanding  than  the  classical  definition. 

Complementing  the  broadening  of  the  definition  of  random  is  a  reinterpretation 
of  the  probability  density  associated  with  an  unknown  quantity.  In  the  Bayesian  view, 
the  density  represents  the  observer’s  judgement  of  the  relative  likelihoods  of  the  possible 
outcomes.  Put  another  way,  the  density  function  is  a  measure  of  the  observer’s 
knowledge  about  the  unknown  quantity.  This  interpretation  raises  several  points. 
First,  the  density  function  is  a  feature  of  the  system/observer  pair,  and  is  not  an  intrin¬ 
sic  feature  of  the  system.  Furthermore,  it  may  be  legitimately  different  for  different 
observers  of  the  same  system,  since  different  observers  may  have  differing  amounts  of 
information  about  the  system.  Second,  the  observer  must  be  able  (willing?)  to  quantify 
his  knowledge  in  the  form  of  a  density  function.  While  we  will  take  the  position  that 
this  is  a  given,  we  point  out  that  it  is  not  a  trivial  problem.  Third,  this  interpretation 
need  nut  be  at  odds  with  the  classical  relative  frequency  interpretation.  Faced  with  a 
system  for  which  we  have  historical  information,  it  is  reasonable  to  assume  (without 
information  to  the  contrary)  that  the  same  relative  frequencies  of  outcomes  will  con¬ 
tinue.  Lastly,  despite  the  broadening  of  our  interpretation,  all  the  familiar  rules  for 
manipulating  densities  still  apply. 

Now  look  at  these  two  ideas  in  the  context  of  an  estimation  problem.  Since  we 
want  eventually  to  use  our  information  in  a  decision  process,  and  since  the  density 
describes  our  information,  we  must  need  to  calculate  the  entire  density  function  for  the 
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unknown  quantity.  This  will  describe  the  probability  (likelihood)  of  a  particular  out¬ 
come  conditioned  on  our  knowledge.  If  we  are  fortunate,  we  may  have  subsequent 
observations  of  the  sy  em.  The  additional  information  can  be  incorporated  into  our 
density  using  standard  probability  theory,  resulting  in  a  new  density  conditioned  on  our 
increased  knowledge.  Again,  the  entire  density  is  required  to  describe  the  information 
we  have  about  the  unknown  quantity. 

The  main  point  is  that  the  Bayesian  objective  is  to  calculate  the  entire  proba¬ 
bility  density  function  of  the  unknown  quantity,  based  on  the  available  information. 
This  work  focuses  on  the  discrete-time  state  estimation  problem,  where  we  want  infor¬ 
mation  about  the  state  of  a  system  given  imperfect  observations  of  the  system.  More¬ 
over,  we  want  to  do  so  recursively,  so  that  we  update  our  information  step  by  step  as 
we  get  new  observations.  We  will  see  shortly  that  this  problem  naturally  yields  a  recur¬ 
sive  solution. 

1.2.  The  General  Solution 

In  the  present  case,  the  unknown  quantity  is  the  state  of  a  discrete-time  system 
which  propagates  through  time  according  to 

+  i  =  /*(**>«*.«'*) 

where  R"  is  the  system  state  at  time  1^0,  R”  is  the  known  control  applied  to  the 
system,  u'jfe  R'  is  an  unknown  disturbance,  and  /*:R"xR'"xR'-«  R"  is  the  system  transi¬ 
tion  function.  The  only  assumption  on  the  times  represented  by  k  is  that  they  are 
sequential;  in  particular,  it  is  not  necessary  that  they  are  equally  spaced  in  time.  Note 
that  the  system  is  memoryless,  in  the  sense  that  the  the  state  at  time  l-t  1  depends  only 
on  the  state  and  inputs  at  time  k,  and  not  on  the  past  history  of  the  state  or  inputs. 
The  information  available  to  us  is  contained  in  a  set  of  measurements,  augmented  by 
the  known  inputs.  The  measurements  zjt  R”  are  given  by 

where  R**  is  an  unknown  disturbance,  and  /it:R''-  R''  •  R“  is  the  measurement  function. 
For  convenience  define  the  information  vector 

i 

The  vector  represents  all  the  information  we  have  gathered  about  the  system  during 
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its  operalioii. 

Suppose  that  we  w'ish  to  solve  tlu-  f)reclictioii  problem.  In  other  w'ords,  w'e  w'ant 
to  know  at  each  time  k  the  stale  of  the  system  at  the  next  time,  k-r\.  In  view  of  the 
unknown  disturbances  iii  the  system  and  the  observations,  it  is  clear  that  we  cannot 
know  this  precisely.  We  can  however  make  an  assessirient  of  the  likelihood  of  the  sys¬ 
tem  state  taking  on  any  particular  value  based  on  the  information  available  to  us.  In 
the  Bayesian  view,  this  means  we  want  to  calculate  the  conditiotial  prediction  density 
^k)-  Applying  elementary  rules  for  manipulating  densities,  we  can  write 


^k)  ^  f 


P(^k\  ^k  l.^t)  P(^t,l!  IkJk)p{*‘k\  Ik,2k<^k-\) 
pI"*!  2k,^k-i) 


dzk 


If  we  define  the  lime  update  density  t^  as 


P(*ttil  ^k<Zk)p(»k\  ^k<2k<Zk-\) 


then  the  above  becomes 


P{lt.ll  2*)  J  P(**l  di*  (1.1) 

The  function  represents  the  likelihood  of  the  system  state  transitioning  from  a  given 
state  at  time  k  to  another  at  ib+1,  and  is  calculable  from  the  model  description  and  the 
information  vector.  The  density  p(i*|  is  the  conditional  density  for  the  state  at 

time  it  given  observations  through  time  it,  and  is  known  as  the  filtering  density.  We  see 
from  (1.1)  that  the  conditional  prediction  density  is  the  nonlinear  convolution  of  the 
current  filtering  density  and  the  time  update  density. 

Let  us  look  now  at  the  filtering  density.  Applying  Bayes’  theorem,  we  get 


p{ik 


Zk  U^k)  = 


P(jtl  ^k  l)  P(^k  \  Zk,^k-i) 
P(^*l  ^t-i) 


P(^J  ^k  t)  P(^tl  Ik  ^k  i) 

f  plzkl  A-i)  p{^kl  Zk.^k  i)  d^k 

l)t  fine  the  measurement  update  density  as 


so  that  the  above  becomes 


P(**l  ^  1,2*) 


pjzkl  Zj  ,} 

f  P(2*l  l)  "»t(2*.2*)  dZt 


(1.2) 


I'Jie  riM;asuremenl  updalt-  density  describes  the  likelihood  of  gelling  ihe  parlicular  obser- 
valion  from  different  possible  slale  values,  and  is  calculable  from  ihe  model  descrip- 
lioti  and  ihe  dala.  Equalion  (1.2)  shows  lhal  ihe  fillering  density  is  oblained  by  multi¬ 
plying  ihe  Iasi  prediction  density  by  the  measurement  density  and  then  normalizing. 

We  cati  continue  to  apply  (1.1)  and  (1.2)  alternately  until  we  reach  k  At 
that  point,  we  must  assume  an  initial  detisity  pUu)  p(xo-^  i)i  prediction  density 
for  the  slale  at  time  zero  before  any  observat  ions  are  made.  The  three  elements  that  we 
need  to  compute  p(i*,il  Z^)  then  are  p(ro)i  •’'<1  ‘‘•‘d  t,  for  i  =  i,k.  These  densities  are 
ktiown  as  the  priors  for  the  problem  and  summarize  our  a  priori  knowledge  about  the 
behavior  and  structure  of  the  system.  Although  the  lime  and  measurement  update  den¬ 
sities  may  in  general  depend  parametrically  on  the  particular  measurement  realization, 
they  are  structurally  determined  by  the  system  model. 

Viewed  from  the  opposite  direction  in  time  (from  i  =  0  forward),  equations  (1.1) 
and  (1.2)  form  a  natural  recursion.  Given  the  priors,  we  alternately  calculate  the  condi¬ 
tional  prediction  density  and  conditional  Tiltering  density  as  we  accumulate  observa¬ 
tions.  These  two  equations  represent  the  complete  formal  solution  to  the  recursive 
Bayesian  state  estimation  problem.  Although  the  solution  is  easy  to  obtain  in  this  gen¬ 
eral  form,  its  practical  solution  is  not  so  simple.  As  in  many  cases  in  engineering, 
approximate  methods  of  solution  must  be  devised. 

1.3.  Structure  of  this  Dissertation 

This  dissertation  is  basically  divided  into  two  parts.  First,  we  develop  a  new 
approach  to  the  implementation  of  equations  (1.1)  and  (1.2).  Then,  we  use  that 
approach  as  a  tool  in  exploring  sonre  estimation  problems. 

Beginning  the  first  part,  chapter  2  discusses  previous  work  in  developing 
approximate  ways  of  itnpicmentirig  the  recursion,  and  relates  this  work  to  those  efforts. 
Ba.si(  ally  what  has  been  done  is  to  approximate  the  densities  in  various  ways,  so  that 
I  lie  convolution  in  (1.1)  and  the  integration  in  (1.2)  can  be  done  numerically.  The 
ap|iroacli  here  is  to  use  a  simple  approximation  (piecewise  constant)  that  results  in  a 
cum[>ut ationally  efficient  algorithm.  The  algorithm  itself  is  presented  in  Chapter  3. 
First  the  basic  equations  are  derived  for  a  somewhat  more  restrictive  model  than  the 
one  given  in  the  previous  section.  Next,  implementation  of  these  equations  is  discussed, 
sitice  the  computational  advantage  of  this  method  obviously  depends  on  the 
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iiiipliMiitTitiii ion  (lelails.  Fitiallv,  :>in<'«-  liiis  is  an  approximate  melhod  and  there  is  some 
error  assta  iated  wiin  it,  an  analysis  of  the  error  and  its  pr(>pagalion  is  presented. 

Chapter  4  rovers  some  simple  examples  of  the  application  of  the  algorithm  to 
demonstrat*'  its  performance.  A  .scalar  linear  ssstem  with  (iaussian  noise  is  considered 
first.  'I’his  allows  the  approximation  t<i  be  compared  against  the  exact  solution,  since 
the  exalt  solution  for  this  case  is  given  by  the  Kalman  filter.  'I’his  simple  example  is 
studied  (juite  thoroughly  to  demonstrate  the  performance  of  this  techni<|ue.  Finally,  a 
simple,  nonlinear,  non-Gaussian  case  is  considered  to  further  demonstrate  the  method. 

I  sing  this  algorithtn  as  a  tool,  we  then  look  at  some  broader  issues  associated 
witli  the  Bayesian  approach.  Kven  with  the  computational  improvement  of  this 
method,  the  Bayes  solution  is  fairly  expensive.  The  logical  question  is  what  do  we  gain 
by  using  this  approach.  The  answer  to  that,  of  course,  is  highly  problem  dependent. 
VN  e  can  descrilie,  however,  several  general  ways  that  the  algorithm  can  be  of  use  to  us. 
First,  simply  working  with  the  equations  and  thinking  of  the  process  in  terms  of  opera¬ 
tions  on  densities  provides  a  deeper  understanding  of  the  mechanics  of  the  process.  For 
more  detail,  we  can  use  the  close  approximation  of  the  posterior  density  that  the  algo¬ 
rithm  generates  to  plot  the  prediction  density  or  develop  other  useful  descriptions  of  it. 
This  can  provide  additional  insight  into  the  behavior  of  the  system,  and  is  useful  in 
much  the  same  way  as  traditional  simulation  is  for  deterministic  systems.  Furthermore, 
as  is  done  with  deterministic  simulation,  we  can  use  the  technique  to  examine  how  the 
behavior  of  the  system  varies  as  we  change  features  of  the  system.  Finally,  again  since 
we  have  a  representation  of  the  posterior  density,  we  can  actually  calculate  the  optimal 
estimator  for  any  given  loss  function.  Thus  we  can  compare  the  performance  of  tradi¬ 
tional  point  estimators  directly  to  that  of  the  optimal.  In  this  way  we  can  judge,  for 
instance,  whether  apparently  poor  performance  is  a  fault  in  the  estimator,  or  is  actually 
just  the  best  that  one  can  expect.  This  could  be  particularly  useful  in  assessing  tran¬ 
sient  [)erformanc»‘.  where  there  are  virtually  no  theoretical  results.  In  CMiapter  5  we  con¬ 
sider  these  uses  in  a  concrete  situation,  in  the  context  of  the  problem  of  simultaneous 
state  and  parameter  estimation  for  a  linear  system.  In  Chapter  6  we  continue  by  look¬ 
ing  in  some  detail  at  a  passive  tracking  problem,  where  the  measurements  consist  only 
ol  b«-arings  to  a  target.  Lastly,  Chapter  7  contains  a  summary  of  the  major  conclusions 
of  this  work. 


2.  REVIEW  OF  PREVIOUS  WORK  AND  RELATION  TO  THIS  EFFORT 

Equations  (1.1)  and  (1.2)  represent  the  most  general  solution  to  the  recursive 
Bayesian  state  estimation  problem.  The  difficulty  is  that  there  is  generally  not  a  closed 
form  solution  to  this  recursion.  The  notable  exception  is  when  the  system  is  described 
by  the  linear  equations 

1*^1  =  /jZ*  i 

Zk  =  +  l>* 

where  u;*  and  t*  are  white  Gaussian  sequences  uncorrelated  with  the  state.  If  the  initial 
density  is  Gaussian,  the  conditional  densities  remain  Gaussian,  and  the  mean  and  vari¬ 
ance  of  these  densities  propagate  according  to  the  well-known  Kalman  filter  equations. 
Since  a  Gaussian  is  completely  described  by  its  mean  and  variance,  the  Kalman  filter  is 
in  fact  the  Bayes  solution  for  this  model.  This  fact  apparently  was  first  noted  by  Ho 
and  Lee  |33j. 

Ideally,  we  would  like  the  densities  to  maintain  the  same  functional  form,  as  in 
the  linear  Gaussian  case,  so  that  the  Bayesian  recursion  could  be  accomplished  by  alge¬ 
braically  updating  a  finite  number  of  parameters.  Densities  with  this  property  are 
known  as  reproducing  densities.  Spragins  |34)  has  shown  for  the  special  case  of  recur¬ 
sively  estimating  a  fixed  parameter  that  a  reproducing  density  exists  if  and  only  if  there 
exists  a  sufficient  statistic  for  the  data  which  is  expressible  as  a  vector  of  fixed  length. 
This  is  not  true  in  the  general  case,  however,  because  of  the  presence  of  the  time  update 
due  to  the  system  dynamics  and  input  noise.  Reproducing  densities  have  been  shown  to 
exist  for  a  few  particular  combinations  of  system  dynamic  functions  and  noise  densities 
[4,5],  but  they  are  not  generally  useful.  It  is  difficult  to  find  reproducing  densities  for 
this  recursion  because  of  the  two  very  different  operations  involved:  multiplication  for 
the  measurement  update,  and  non-linear  convolution  for  the  time  update.  Closure  with 
respect  to  both  operations  is  rare. 

When  the  model  is  not  linear  Gaussian,  we  must  look  for  approximate  means  of 
evaluating  the  recursion  (l.l)-(1.2).  The  earliest,  and  certainly  the  most  used  approxi¬ 
mation,  is  the  extended  Kalman  filter  or  EKF.  In  the  EKF,  the  system  model  is  linear¬ 
ized  about  the  mean  of  the  leist  prediction  density  and  the  Kalman  filter  equations 
applied  directly  to  the  linearized  system.  In  the  Bayesian  context,  this  amounts  to 
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approximating  the  measurement  and  time  update  densities  by  Gaussians.  This  in  turn 
implies  that  the  conditional  densities  are  approximated  by  Gaussians.  Glearly,  this  is  a 
crude  approximation,  highly  dependent  on  the  quality  of  the  linearization  and  the 
actual  form  of  the  noise  densities.  In  practice,  however,  this  approximation  has  been 
shown  to  be  reasonably  good  in  a  wide  variety  of  cases.  (For  a  number  of  good  exam¬ 
ples  of  applications  of  the  EKF,  see  [32].)  Its  principal  advantage  is  that  it  is  by  far  the 
computationally  simplest  of  all  the  approximation  techniques.  On  the  other  hand,  the 
EKF  is  subject  to  divergence,  where  the  actual  error  in  the  estimate  exceeds  the  error 
covariance  approximation  provided  by  the  filter.  The  EKF  also  has  poor  transient 
response  in  many  cases,  even  though  the  steady  state  performance  may  be  acceptable. 
The  difficulties  arise  since  the  relatively  smooth  approximating  Gaussian  may  miss 
important  features  of  the  true  densities.  Considerable  work  has  been  done  on  improving 
the  EKF  by  various  means,  but  we  will  not  discuss  those  modifications  here.  Since  vir¬ 
tually  all  of  these  techniques  are  derived  as  point  estimators,  and  not  as  true  approxi¬ 
mations  to  the  conditional  densities,  they  are  not  Bayes  estimators. 

Series  expansions  such  as  Gram-Charlier  and  Edgeworth  expansions  were  the 
next  attempt  to  improve  the  fidelity  of  the  density  approximations  |6-n].  In  this  tech¬ 
nique,  the  densities  are  approximated  by  an  infinite  sum  of  polynomials  which  are 
orthogonal  with  respect  to  the  Gaussian  density  and  can  be  used  to  represent  a  large 
class  of  other  densities.  The  convolution  of  (1.1)  can  then  be  represented  as  the  sum  of 
convolutions  of  the  simpler  polynomials.  In  theory,  this  can  provide  arbitrarily  good 
approximation  to  the  actual  densities.  In  practice,  however,  it  was  found  that  a  large 
number  of  terms  are  needed  to  get  reasonable  accuracy  for  distinctly  non-Gaussian  den¬ 
sities.  In  addition,  when  the  series  are  truncated,  as  they  must  be,  the  resulting  approx¬ 
imation  can  be  negative  over  parts  of  the  state  space.  Hence  the  approximation  may 
not  be  a  density  itself.  This  can  cause  significant  disturbances  in  subsequent  approxi¬ 
mate  densities,  as  well  as  in  any  expectations  taken  in  the  decision  process. 

Both  these  previous  techniques  are  essentially  local  techniques,  since  they  are 
designed  to  be  most  accurate  in  a  restricted  central  region.  The  remaining  techniques 
we  will  discuss  are  global,  in  that  they  attempt  to  provide  a  uniform  degree  of  approxi¬ 
mation  over  the  entire  densities.  (This  distinction  is  due  to  Sorenson  |3l].)  There  are 
basically  two  global  approaches:  direct  approximation  of  the  densities,  and  approxima¬ 
tion  of  the  integrals  in  (1.1)  and  (1.2).  Both  approaches  are  based  on  sampling  the 
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densities  at  points  distributed  through  the  region  containing  non-negligible  probability. 
In  function  approximation,  the  points  and  the  approximate  density  values  at  those 
points  are  chosen  on  the  basis  of  matching  some  interpolating  function  to  the  density. 
The  integration  (or  functional)  approximation  approach  picks  the  points  based  on  a 
particular  numerical  integration  scheme.  Of  course,  the  two  a|>pruaches  are  usually  dual 
in  the  sense  that  a  particular  interpolation  method  results  in  the  integration  being 
equivalent  to  some  numerical  integration  approximation,  and  similarly,  a  numerical 
integration  method  implies  a  particular  interpolation  scheme.  Nevertheless,  this  distinc¬ 
tion  is  useful  for  classifying  the  approximate  Bayes  methods. 

The  simplest  of  the  global  function  approximation  methods  is  the  point-mass 
method  introduced  by  Bucy  [12]  and  elaborated  by  Bucy  and  Senne  [13|.  In  this 
method,  the  densities  are  approximated  by  point  masses  located  on  a  regular  grid.  To 
keep  the  grid  compatible  with  the  evolving  density  while  reducing  storage  requirements, 
Bucy  and  Senne  proposed  a  floating  rotating  grid.  The  grid  at  each  time  is  centered  on 
the  mean  of  the  prediction  density  and  based  on  the  eigenvectors  of  the  covariance 
matrix.  Bucy  and  Senne  also  reduced  storage  by  retaining  only  those  points  in  an  ellip)- 
soid  based  on  the  covariance  matrix.  The  size  of  the  ellipsoid  is  chosen  so  that  only  a 
fixed  fraction  of  the  total  probability  mass  is  discarded.  The  final  refinement  was  the 
use  a  separate  grid  for  each  mode  if  the  densities  were  multimodal.  Given  an  a  priori 
density  approximated  in  this  way,  the  filtering  density  resulting  from  (1.2)  is  also  a  col¬ 
lection  of  point  masses.  The  convolution  of  (l.l)  can  then  be  written  as  a  summation. 
This  summation  is  then  evaluated  at  the  new  grid  points  to  provide  the  approximation 
to  the  new  prediction  density. 

An  alternative  approach  was  introduced  at  about  the  same  time  by  Alspach 
and  Sorenson  [14-17|.  In  this  method,  the  densities  are  approximated  by  a  weighted 
sum  of  Gaussians,  each  centered  on  a  different  point  in  state  space.  The  recursion  can 
then  be  written  as  a  collection  of  EKF’s  operating  in  parallel.  Since  each  Gaussian  in 
the  sum  has  comparatively  small  variance,  it  is  more  likely  that  the  linearization  in  the 
KKF  will  remain  a  valid  approximation.  The  choice  of  grid  point  locations,  variances, 
and  weights  is  best  done  by  minimizing  the  Lj  norm  of  the  error,  although  approximate 
methods  can  be  used  to  reduce  computation. 

Both  these  methods,  particularly  the  point-mass  method,  have  considerable 
computational  and  storage  requirements,  although  they  can  provide  reasonably  accurate 
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density  approximations.  These  requirements  were  considered  excessive  in  light  of  the 
computational  resources  available  at  the  time  they  were  developed.  Both  speed  and 
storage  were  quite  limited  compared  to  that  available  today.  Of  the  two  limitations, 
storage  was  actually  the  more  restrictive.  This  tended  to  make  it  more  profitable  to 
trade  off  storage  for  computational  complexity.  For  a  given  desired  accuracy,  it  was 
clear  that  you  could  get  by  with  fewer  grid  points  if  you  used  a  more  sophisticated 
interpolation  scheme.  A  theoretical  framework  for  this  work  was  provided  by  Center 
1 18]  who  considered  the  problem  in  terms  of  generalized  least  squares.  His  approach 
allowed  for  the  development  of  an  essentially  unlimited  number  of  different  approxima¬ 
tions,  of  which  the  point-mass  and  Gaussian  sum  approaches  are  examples.  Most  of  the 
remaining  work  in  this  direction  focused  on  different  spline  approximations  [19-22]. 

The  most  recent  of  the  function  approximations  is  the  p-vector  approach  of 
Sorenson  [23].  This  method  is  derived  from  the  Gaussian  sum  method  by  taking  the 
limit  as  the  variances  of  the  individual  Gaussians  go  to  zero.  It  is  also  related  to  the 
point-mass  method,  but  provides  a  computational  advantage  over  both.  This  method 
also  reduces  the  convolution  (1.1)  to  a  summation. 

The  alternative  approach  of  approximating  the  integral  followed  much  the  same 
path.  All  such  methods  amount  to  replacing  the  integral  with  a  weighted  sum  of  sanrj- 
ples  of  the  integrand.  The  particular  quadrature  formula  used  determines  the  locations 
of  the  samples  in  state  space  and  the  weights  used  in  the  summation.  The  basic  issue 
again  is  trading  off  computational  complexity  against  the  number  of  sample  points 
required  to  achieve  a  desired  accuracy.  Klein  [24-27]  in  particular  has  investigated  a 
number  of  different  quadrature  formulas. 

All  these  global  techniques  share  two  related  elements: 

1.  The  method  must  provide  a  procedure  for  defining  the  initial  and  subsequent 

grids.  For  computational  efficiency,  the  grid  should  be  redefined  at  each  time 
to  track  the  conditional  densities.  The  elements  of  the  grid  which  must  be 
specified  are  the  boundaries  of  the  grid  in  state  space,  and  the  number  and  dis¬ 
tribution  of  points  within  the  boundary.  The  distribution  of  the  points  will 
usually  depend  on  the  approximation  technique  that  will  be  used.  In  most  of 
these  techniques,  setting  up  the  grid  is  a  non-trivial  problem. 
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2.  Given  the  grid,  the  means  of  performing  the  Bayesian  update  must  be  specified. 

This  will  depend  on  the  particular  approximation  (either  function  or  integral) 
that  is  chosen.  In  almost  all  cases  (except  Gaussian  sum)  this  results  in  replac¬ 
ing  the  convolution  (1.2)  with  a  discrete  nonlinear  convolution.  It  is  this  opera¬ 
tion  which  tends  to  be  the  most  expensive  computationally  since  it  requires  on 
the  order  of  N*  operations,  where  N  is  the  number  of  grid  points. 

As  discussed  above,  the  direction  of  previous  research  has  been  determined  pri¬ 
marily  by  the  limitations  in  computing  power  that  existed  in  the  early  seventies. 
Recent  advances  in  computer  hardware  have  loosened  those  limitations.  It  is  now  rea¬ 
sonable  to  reduce  computational  complexity  at  the  expense  of  storage  requirements.  As 
memory  costs  continue  to  decline  and  computer  speeds  increase,  this  trade-off  becomes 
even  more  favorable.  This  change  in  the  computing  environment  motivated  the  work 
presented  here.  The  basic  approach  is  to  use  a  simple  approximation  to  the  conditional 
densities  in  order  to  obtain  a  fairly  simple  representation  of  the  recursion  (1.1)-(1.2). 

The  method  that  will  be  presented  in  this  dissertation  is  based  on  approximat¬ 
ing  the  conditional  densities  by  functions  piecewise  constant  on  regions  defined  by  a  reg¬ 
ular  grid.  This  has  several  advantages: 

1.  The  grid  can  be  described  in  a  particularly  simple  way;  the  region  of  interest  is 
filled  with  a  number  of  identical  multidimensional  polyhedrons.  This  results  in 
an  easy  grid  update. 

2.  The  density  approximation  has  a  simple  form  that  makes  such  operations  as 
taking  expectations  easy.  This  is  important  since  we  expect  the  density  to  be 
used  in  some  decision  making  process  which  usually  involves  taking  the 
expected  value  of  some  cost  functional. 

3.  The  convolution  (1.1)  can  be  written  as  a  discrete  linear  convolution  instead  of 
a  discrete  nonlinear  convolution.  This  allows  the  convolution  to  be  calculated 
using  FFTs  at  a  cost  proportional  to  N/n(N)  instead  of  N*. 

4.  The  error  in  approximating  the  conditional  densities  and  its  propagation 
through  the  recursion  can  be  analyzed  in  a  straightforward  way.  Although  the 
error  bounds  turn  out  to  not  be  extremely  tight,  the  analysis  does  point  out  the 
situations  in  which  lead  to  poor  performance. 
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Although  the  idea  of  using  piecewise  constant  approximations  has  been  men¬ 
tioned  briefly  before  28,,  it  was  not  pursued.  I'he  idea  of  manipulating  the  recursion  to 
achieve  a  linear  convolution  appears  to  be  new.  The  use  of  FFTs  for  evaluating  convo¬ 
lutions  is,  of  course,  not  new,  but  application  to  this  problem  is.  Finally,  the  propaga¬ 
tion  of  the  error  through  the  recursion  has  n<it  been  addressed  in  any  depth  before. 


3.  THE  ALGORITHM 


3.1.  Assumptions,  Notation,  and  Conventions 

Before  we  derive  the  proposed  algoritiiiii,  we  must  sp)ecialize  the  very  general 
model  introduced  in  Chapter  1,  and  introduce  some  notation  that  will  make  the  follow¬ 
ing  presentation  clearer.  We  will  restrict  our  attention  to  systems  modeled  by 

+  i  =  /*(**.“»)  +  w* 

2k  =-  (3-1) 

The  noise  sequences  {w^}  and  are  assumed  to  be  white  and  zero-mean,  to  be 

uncorrelated  with  the  state  and  with  each  other,  and  to  have  densities  with  finite  sup¬ 
port.  The  control  sequence  {ut}  is  either  a  priori  known  or  calculable  at  each  time  from 
only  the  available  observations,  which  implies  that  p(uj|  =  ij.Z*,,).  The 

system  dynamics  function  /*  and  the  observation  function  /i*  are  at  least  measurable 
functions. 

While  this  is  a  more  restrictive  model  than  that  of  Chapter  1,  it  is  still  quite 
general.  In  fact,  many  systems  which  at  first  might  not  appear  to  fit  these  assumptions 
can  be  modeled  by  (3.1).  For  instance,  colored  noises  can  be  included  by  appending  a 
subsystem  driven  by  white  noise,  whose  output  is  the  desired  colored  noise.  In  other 
cases,  this  model  can  provide  an  adequate  approximation.  As  an  example,  consider 
Gaussian  noise.  It  has  a  density  with  infinite  support.  An  adequate  approximation 
might  be  to  truncate  the  support  at,  say,  plus  and  minus  three  standard  deviations.  (Is 
anything  really  Gaussian  anyway?  Given  the  experimental  observation  that  most  ’real’ 
noises  are  thinner  in  the  tails  than  the  Gaussian,  the  above  might  actually  be  a  more 
appropriate  model  in  many  cases  than  assuming  Gaussian  noise.) 

Kestricting  ourselves  to  the  model  above  allows  us  to  simplify  the  expressions 
for  the  densities  of  interest.  Applying  elementary  rules  for  manipulating  densities,  as  we 
did  in  Chapter  1,  we  can  write  th?  e.quations  for  the  conditional  pr»‘diction  and  filtering 
densities  fur  the  model  (3.1)  as 

p(2k  \  /*)  C  '  p(zj|  Zt  i)  p(2ti  I*) 

C  J  p(zj|  Zt_,)  p(zj1  I*)  dx*  (3.2) 
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\4 


Following  standard  praciirt*.  niultiditnensional  iiiit^grals  as  above  are  understood  to 
mean 

f  dz  g[z)  -  f  dzf,,  J  J  •  >^(n)) 

wliere  i|,)  is  the  tih  rornpoinM  l  of  llu-  verlor  z.  I'ltless  oLiierwise  noted,  integrals  are 
assumed  to  be  over  the  entire  support  of  the  integrand. 

From  the  model  we  note  that 

where  ^  is  the  density  function  for  u»*.  Now,  for  notational  convenience,  we  define 
*lz)  -  )r*(x*)  =  p(i*|  2»-i) 

mIz.z)  -  Mklzt.zt)  -  p(2*(  I*) 

d>lz)  -  -  p(it|  if*)  (3,3) 

f(i)  =  r*(x)  -  p«,*(z) 

As  implied  in  the  above  definitions,  we  will  drop  the  explicit  dependence  on  the  time  k 
whenever  there  is  no  ambiguity.  To  help  minimize  confusion,  the  variable  k  will  be 
used  only  to  denote  the  time.  With  these  definitions,  the  recursion  (3.2)  becomes 


<p(z)  -  C  '  x*(r)  p(z,z) 

(3.4) 

'fi,i(i)  "  f  't>(y)  r(i  /(y.u))  dy 

(3.5) 

where  the  normalizing  constant  C  is  chosen  so  that  J  (p{z)dz  =  1. 

In  actually  computing  the  recursion,  we  will  use  approximations  to  the  above 
densities,  A  hat  will  be  used  to  signify  an  approximate  density,  and  a  tilde  to  signify 
the  associated  error  function.  Hence,  we  would  write  »f  (z)  =  7r  (i) -')r"(i).  We  will  be 
approximating  the  densities  on  a  regular  multidimensional  grid  in  state  space  defined  by 
the  points  i,  ^  Aj  where  j  is  an  n-veclor  with  integer  elements,  each  in  the  range  1  to 
M.  I'here  are,  therefore,  il-M“  total  grid  points.  The  iixn  matrix  A  and  the  n-vector 
6  are  chosen  so  that  the  grid  covers  some  region  of  interest.  Each  point  x,  of  the  grid  is 
the  center  of  a  cell  in  R"  defined  by 


{z  -V  ;  y(  lo) 


where  Iq  is  the  generic  cell  given  by 

The  volume  of  each  cell,  which  we  will  call  a,  is  given  by  det(^  )]  .  Note  that  the  only 
restriction  on  A  is  that  det(,-l  )?^0.  Thus  we  are  not  restricted  to  rectangular  grids,  and 
so  have  increased  flexibility  in  tailoring  the  grid  to  the  densities.  Figure  3.1  shows  two 
examples  for  n  =  2  and  Observe  also  that  _  Ij  covers  the  entire  region  of  interest 

I 

(by  definition  of  the  grid),  and  that  I,'  1,=0  for  rV  j- 


Finally,  we  note  that  we  will  be  using  multidimensional  summations  in  the  work 
ahead.  They  are  exactly  analogous  to  integrals,  .so 

M  M 

S  V, 

Like  integrals,  the  sum  is  assumed  to  be  over  the  entire  grid  when  no  limits  are  speci¬ 
fied. 

3.2.  Derivation  of  the  Algorithm 

As  discussed  in  the  introduction,  the  difficulty  with  the  recursion  we  have 
derived  for  calculating  the  conditional  prediction  density  is  that  there  is  generally  no 
closed  form  solution.  This  being  the  case,  we  are  forced  to  look  for  some  means  of 
approximating  the  recursion.  The  approach  here  is  to  approximate  the  prediction  den¬ 
sity  by  a  piecewise  constant  function,  and  then  apply  (3.4)  and  (3.5)  to  recursively 
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update  the  approximate  prediction  density.  In  deriving  the  algorithm,  however,  we  will 
start  with  and  work  backward. 

Hegin  by  considering  a  slightly  modified  version  of  equation  (3.5).  If  we  define 
y  =  /(z,u),  then  the  state  update  equation  becomes 

=  y  +  W), 


and  then  (3.5)  becomes 


x{i)  - 


J  Hv)  f 


(i-y)  dy 


where  Vlv)  is  the  conditional  density  function  p(y\  Z^)  and  S(^)  is  the  support  of  V' 
Notice  that  (3.6)  is  a  linear  convolution.  Now  assume  that  we  have  defined  a  grid  that 
covers  the  support  of  v.  (We  will  discuss  how  to  get  such  a  grid  a  bit  later.)  The  best 
L,  or  L]  constant  approximation  to  a  function  over  a  region  is  the  average  of  the  func* 
tion  over  that  region.  Hence,  the  best  piecewise  constant  approximation  to  n  on  the 
given  grid  is  / (z)  =  ir ,,  z€  1^,  where 


-  ^  j  n  (z)  dz  =  J  dz  J'  dy  V'(y)  r(z-v) 


with  the  last  equality  due  to  (3.6).  Since  Wi  is  a  zero-mean  sequence,  we  know  that  the 
origin  is  contained  in  5(r),  the  support  of  r.  This  implies  that  5(^)c  5()r),  so  the 
assumed  grid  also  covers  the  support  of  We  can  therefore  use  the  covering  property 
of  the  grid  to  break  up  the  inner  integral  in  (3.7)  to  get 

^  dz  J]  ^  dy  V'(v)  r{z-y)  =  J  dx  Jdy  V’(y)  r(z-y) 

The  interchange  of  summation  and  integration  is  justified  since  it  is  a  finite  sum.  If  we 
now  now  replace  tp(y)  with  a  piecewise  approximation  V’(y)=V’/5  1/6  1,,  we  get 


a;  ^  j  dx  J  dy  r(z  -  y) 


Consider  the  double  integral  above;  using  the  definition  of  the  grid,  and  with  a  change 
of  variable,  we  get 


—  J  dx  dy  r(z-y)  ~  ^  J  J  r(y4  (i-y)+rfi-rfj) 


Note  that  the  result  is  a  function  only  of  the  difference  (i-;);  this  is  crucial  to  obtaining 
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a  discrete  linear  convolution.  If  we  call  the  result  of  (3.9)  r,  ,,  then  we  get  from  (3.8) 

SV-,  r._,  (3.10) 

I 

We  note  this  is  a  discrete  linear  convolution.  Equation  (3.9)  is  also  the  expression  for  a 
limes  the  best  piecewise  approximation  to  r  over  the  region  I,*!;,  so  in  a  sense  we  have 
replaced  both  and  r  with  their  piecewise  constant  approximations  in  getting  to  equa¬ 
tion  (3.10). 

To  get  to  equation  (3.8)  we  introduced  the  piecewise  constant  approximation  to 
V*.  As  with  ar  we  define  t^(v)^V’.»  where 

=  S  J  ^(„)  dy 

Recall  that  V'(v)  is  the  conditional  density  of  v  =  f(t,u).  Using  this,  and  suppressing  the 
conditioning  in  the  notation,  we  can  rewrite  the  integral  as 

rv,(y)  ily  ^  Prob(v€  1.)  ^  Prob(x€  H,  ;H,=  {i:/(z,u)6  1,})  =  f^{z)dz  (3.11) 


As  long  as  /  is  measurable,  H,  is  a  measurable  set  (since  it  is  the  inverse  image  of  an 
open  setj,  and  so  the  integral  is  well  defined.  Now  suppose  that  we  have  the  piecewise 
constant  approximation  *  i(z)  from  the  last  time  step,  defined  by  »4(*)^xt,,,  *6  1*,,, 
with  It ,  based  on  the  grid  from  the  last  iteration.  Using  this,  and  the  definition  of  ^ 
from  (3.4),  in  (3.11),  we  get 

fd>li)dz=  fc' xt(x)/4(z,z)  dz  ~  5]  r  7ft(z)/i(z,z)  dz 

In  the  above  we  used  the  fact  that  the  old  grid  covers  the  support  of  z't,  just  as  the 
current  grid  covers  the  support  of  the  current  if.  This  now  allows  us  to  write  v*.  in 
terms  of  the  previous  prediction  density  and  the  measurement  update  density  as 


(3.12) 


The  integral  term  above  is  the  constant  approximation  of  over  the  region  H.nl^  j. 
Hence,  as  with  (3.10),  we  have  effectively  replaced  both  densities  with  their  approxima¬ 
tions. 


Tht*  constant  C  in  (3.12)  is  to  be  chosen  so  that  the  densities  are  properly  nor¬ 
malized,  that  is,  so  that  they  integrate  to  one.  Since  we  are  primarily  interested  in  the 
prediction  density  approximation  jf,  it  is  convenient  to  move  the  constant  from  (3.12) 
to  (3.10)  and  leave  unnormalized.  Applying  the  normalization  requirerrient  to  if  and 
using  the  modified  form  of  (3.10)  gives 

J  7r'(z)  dz  -  a  5]  IT  ,  =  f»  6’  '  5]  S  V’,  =  1 

*  •  ; 


Therefore,  we  must  have 


c  O  £  5]  V'j  r.-i  -  “  S  V-y  S  *‘•-1  =  “  S  v*, 

<  J  t  •  1 

(3.13) 

We  now  have  the  basic  relations  for  recursively  calculating  the  approximation 
to  the  conditional  prediction  density.  Combining  (3.9),  (3.13),  and  the  modified  ver¬ 
sions  of  (3.10)  and  (3.12),  we  get 

Hj  =  {z  :  /*(z,u)e  ly} 

(3.14a) 

V',  =  —  S  »*.,  f  /**(*.«)  dz 
“  <  h/i*,. 

(3.14b) 

r,_y  =  Q  r  diJ,  r  d6jTi,{A(i-j)+Si-S2) 

Iq  *0 

(3.14c) 

C  -aY.'l’, 

J 

(3.14d) 

Ml,.  =  C*'  5]  V'y  r,-, 

(3.14e) 

It  should  be  pointed  out  that  the  filtering  density  ^  is  not  calculated  explicitly 
in  the  recursion  above.  If  for  some  reason  4>  *s  needed,  we  can  readily  calculate  the 
piecewise  constant  approximation  <^(i)=<^i,  ze  lk,i,  by 

<p, - —  f  C~‘  if(r)  dx 

f!.. 

C  '  71  t  ,  —  f  dt  =  C  '  11*,,  /i, 

C  is  chosen  so  that  ^  will  be  a  normalized  density,  so 
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If  we  have  calculated  0^  we  can  modify  (3.14b)  to  save  some  computation  (but  with 
some  loss  in  accuracy)  by  substituting  ^  for  0  in  (3.11).  Since  is  piecewise  constant 
on  the  old  grid,  we  get 

V,  =-  —  Y^<P,  f  dx 

If  we  choose  to  do  this,  V’  will  be  normalized,  so  we  can  drop  (3.14d)  and  remove  C  * 
from  (3.14e). 

S.S.  Interpretation  aa  a  Probability  Maaa  Filter 

It  is  interesting  to  note  that  the  recursion  described  by  (3.14)  can  also  be 
derived  from  a  rather  different  point  of  view.  Based  on  this  alternate  approach,  the 
recursion  can  be  interpreted  as  updating  the  probability  mass  associated  with  each  ele¬ 
ment  of  the  grid,  as  opposed  to  updating  the  probability  density. 

We  begin  as  in  section  3.2  by  letting  i/  =  /(z,u),  and  assuming  that  we  have  a 
grid  that  covers  the  region  of  interest  in  state  space.  The  conditional  probability  that 
+  ,  is  in  I,  is  given  by 

Prob(x*^,eI,)  =  E  Prob(v€l,)  Prob(x*„€  1. 1  y€  1,)  (3.15) 

I 

The  conditioning  on  the  measurements  has  been  omitted  from  the  notation  for  clarity. 
Recalling  equations  (3.11)  and  (3.12), 

Prob(ytl,)  =  C~*  ^  i(x)  //(x,x)  dz  (3.16) 

where  11^  is  the  inverse  image  of  1,  as  before.  Suppose  that  we  have  the  set  of  probabili¬ 
ties  Prob(ije  Ij ,)  from  the  last  time  step.  If  we  make  the  approximation  that  the  mass 
is  uniformly  distributed  within  each  region,  then 

jt^It)  -  — Prob(z*e  Ij,.) 

Making  this  substitution  in  (3.16)  gives 

Prob(i,c  IJ  =  f7-'  E  —  Prob(z*e  1*,.)  f  ^{z,z)  dz  (3.17) 

-  h/i*. 

ihe  second  factor  in  the  sum  in  (3.15)  can  be  written  as 
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Prob(it^,e  1,1  yfc  ly)  = 


v)  p(v) 


This  lime  we  make  the  approxirnaiion  that  the  mass  for  y  is  uniformly  distributed  over 
each  region,  so  p(|/)  is  constant  over  1^.  This,  combined  with  the  definition  of  r,  gives 

dx  J'  dy  t{i  -  y  I 

Equations  (3.15),  (3.17),  and  (3.18)  give  an  approximate  recursion  for  calculating  the 
probability  masses  on  each  interval. 

The  equivalence  between  this  recursion  and  (3.14)  is  obvious  given  the  following 
observations: 


Prob(it^,t  1,1  yt  IJ  -  ^ 


Prob(z*€  1*,.)  =  y 
Prob(y€  IJ  -  a  C'*  V*/ 

Prob(z6  M  ye  ly)  =  r,_y 

Making  these  substitutions  in  (3.15),  (3.17),  and  (3.18)  produces 

V-y  —  E  ^  k.,  f  /*(*.«) 

’r  C-‘ SV-,  r..y  (3.20) 

) 

The  constant  C  is  obtained  by  applying  the  normalization  constraint,  which  gives 

c=T.^.- 

t 

The  recursion  derived  in  this  section  (equations  (3.20))  is  identical  to  that  of 
(3.14).  We  conclude  that  the  recursion  can  be  interpreted  either  as  an  update  of  the 
approximate  probability  density  functions,  or  as  an  approximate  update  of  the  probabil¬ 
ity  masses  for  each  region  in  the  grid. 


3.4.  Implementation 

Several  aspects  of  the  implementation  of  the  recursion  defined  by  (3.14)  need  to 
be  discussed  before  we  consider  any  specific  examples. 


21 


As  the  basic  equations  stand,  there  is  little  to  recommend  diem  computationally 
over  any  other  approach,  particularly  since  they  are  based  on  a  fairly  crude  approxima¬ 
tion,  and  therefore  may  require  many  points  to  get  a  given  accuracy.  The  main  culprit 
is  (3.14e)  which  is  a  discrete  convolution  of  the  two  sequences  {V’,}  and  {r,}.  Direct 
computation  of  this  convolution  requires  on  the  order  of  N*  operations,  where  N  is  the 
total  number  of  points  in  the  grid.  It  is  not  necessary,  however,  to  compute  it  directly. 
Since  it  is  a  linear  convolution,  we  can  use  transform  techniques.  Specifically,  we  can 
take  the  product  of  the  discrete  Fourier  transforms  (DFT)  of  the  sequences,  and  take 
the  inverse  transform  of  the  result.  Using  standard  FFT  techniques,  a  DFT  requires  on 
the  order  of  N/n(N)  operations.  The  FFT  approach,  then,  requires  on  the  order  of 
(SN/n(N)  +  N)  operations  (two  forward  DFTs,  one  multiplication,  and  one  inverse  DFT). 
It  does  not  take  a  very  large  N  to  get  substantial  savings.  We  can  also  use  the  consider¬ 
able  work  that  has  already  been  done  in  developing  highly  efficient  FFT  algorithms  ((29) 
and  [30]  are  recent  examples)  to  further  increase  the  advantage.  Since  FFT  algorithms 
tend  to  be  highly  parallel,  it  is  also  possible  to  take  advantage  of  various  computer 
hardware  such  as  parallel  and  array  processors. 

There  is  potential  danger  in  the  FFT  approach,  though,  since  the  FFT  convolu¬ 
tion  is  circular,  and  hence  may  result  in  aliasing.  To  avoid  this,  the  grid  must  be  large 
enough  to  completely  cover  the  support  of  t .  We  assumed  this  at  the  beginning  of  sec¬ 
tion  3.2,  so  this  requirement  does  not  affect  the  algorithm  directly.  But  since  both  (V',} 
and  {r,}  are  defined  on  this  same  grid,  the  above  implies  that  portions  of  both  sequences 
will  necessarily  be  iero.  This  in  turn  indicates  there  is  a  certain  amount  of  waste  in 
both  storing  and  performing  computations  on  the  zero  elements.  It  may  be  advisable  to 
use  dynamic  storage  techniques  to  only  store  the  non-zero  elements,  and  customize  the 
FF'P  algorithm  to  take  advantage  of  the  fact  that  only  the  middle  portion  of  these 
sequences  is  non-zero. 

The  FFT  approach  also  provides  the  normalizing  constant  C  given  by  (3.14d). 
In  many  cases,  this  normalization  is  not  really  necessary,  since  we  lose  no  information  if 
we  choose  not  to  normalize.  Dropping  C  would  save  computation  at  a  slight  cost  of 
ease  in  interpreting  the  results.  Some  care  must  be  taken  to  insure  that  the  unnormal¬ 
ized  density  does  not  exceed  the  numerical  limits  of  the  computer,  though.  It  is  quite 
possible  for  it  to  go  to  zero  everywhere  as  far  as  the  computer  is  concerned.  In  any 
case,  we  get  C  automatically,  and  can  then  normalize  at  the  cost  of  only  N  multiplies. 
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We  get  C  as  follows.  Let  {V'  }  be  the  transform  sequence  of  {yi}.  From  the  derinition  of 
the  DFT, 

V'.  -  2]  •‘*p( - ■  ^  ^  - ‘-I 

so  Using  this  in  (3.14d)  gives  Since  we  have  calculated  the  transform 

I 

of  V'l  ^  is  readily  available.  The  normalizing  constant  can  be  included  when  the 
transform  sequences  are  multiplied  together,  or  n  can  be  normalized  after  the  inverse 
transform,  or  it  can  be  left  unnormalized  and  C  carried  along  to  be  used  if  needed. 

Next,  note  that  (3.14b)  and  (3.14c}  require  evaluating  integrals.  Two  factors 
make  this  less  troublesome  than  it  seems.  First,  the  integrands  are  a  priori  given  func¬ 
tions,  not  involving  the  approximated  prediction  density.  Equation  (3.14c)  is  particu¬ 
larly  straight-forward  since  it  is  over  a  single  well-defined  region.  Second,  as  alluded  to 
in  section  3.3,  the  integrals  can  generally  be  evaluated  in  terms  of  the  noise  distribution 
functions.  Thus  we  are  trading  evaluations  of  the  density  function  for  evaluations  of 
the  distribution  function.  For  the  many  cases  where  the  distribution  function  is  known 
in  closed  form,  this  is  no  additional  work.  In  the  cases  where  the  noise  characteristics 
have  been  found  experimentally,  the  distribution  function  is  generally  more  accurately 
estimated  than  the  density.  In  some  cases  then,  this  method  may  actually  be  easier  and 
more  accurate.  If  the  distribution  functions  are  not  available  in  closed  form  and  numer¬ 
ical  integration  is  used,  however,  these  equations  may  represent  a  large  computational 
burden. 

Equation  (3.14b)  has  another  nasty  feature  in  addition  to  the  integral.  As  writ¬ 
ten,  it  is  a  convolution  which  would  require  on  the  order  of  operations  in  addition  to 
the  integrals.  In  general,  this  could  be  true.  In  practice,  the  system  dynamics  will 
probably  be  well-behaved  enough  that  most  of  the  sets  H^n  I,  will  be  empty  for  any 
given  j.  Since  we  only  have  to  sum  over  the  non-empty  intersections,  this  can  be  a  sig¬ 
nificant  reduction.  The  regularly  shaped  grid  regions  I,  make  it  fairly  easy  to  both 
determine  which  regions  will  intersect,  and  describe  the  intersection,  even  though  the 
inver.se  image  sets  11,  may  be  strangely  shaped. 

Specification  of  the  inverse  image  set  defined  by  (3.14a)  can  be  the  most  alge¬ 
braically  difficult  part  of  implementing  this  algorithm.  H,  is  usually  described  by  giving 
the  equations  of  its  sides,  each  side  being  the  inverse  image  of  a  side  of  the  original  grid 
region.  Depending  on  the  form  of  /,  these  can  be  awkward  equations.  If  /  is  one  to 


one,  H,  will  be  a  single  region.  If  /  is  many  to  one  though,  it  will  likely  be  several 
disconnected  regions,  and  so  the  integral  in  (3.1  “lb)  will  be  the  sum  of  integrals  over 
those  regions.  Fortunately,  the  problems  with  sjH'cifying  H,  typically  occur  while  coding 
the  algorithm  for  a  particular  system,  and  are  not  a  major  computational  burden  when 
running  it.  We  will  see  some  examples  of  calculating  H,  and  H,n  1,  in  the  applications 
sections  ahead. 

The  recursion  of  (3.14)  does  not  specify  how  to  select  tlie  new  grid  at  each  time 
step,  other  than  to  require  that  it  covers  the  support  of  the  new  prediction  density. 
The  support  can  be  calculated  before  n  itself  is  as  follows.  Let  S(n^)  be  the  support 
and  be  the  support  of  fi.  Then  the  support  of  4>  is  =  since  ^  is 

the  product  of  and  /i.  The  support  of  V'l  ^'(0))  is  the  smallest  region  containing  the 
image  of  5(^)  under  the  system  dynamics  function  /.  Finally,  the  support  of  is 

=  {z  =  y +2  ;  ye  ze  5(r)} 

The  grid  can  then  be  chosen  in  any  convenient  way,  as  long  as  it  completely  covers 
5(z  It  is  worth  pointing  out  that  the  grids  at  each  time  may  be  completely 

independent,  even  to  containing  different  numbers  of  points. 

Note  that  the  preceding  depends  on  the  assumption  made  earlier  that  the  noise 
densities  have  finite  support.  There  are  a  number  of  commonly  used  noise  densities 
which  have  infinite  support.  In  those  cases  it  is  necessary  to  truncate  the  densities  at 
some  point  to  achieve  finite  support.  Done  judiciously,  this  truncation  will  have  negligi¬ 
ble  effect. 

To  initialize  the  recursion,  we  need  to  specify  an  initial  grid  and  prediction  den¬ 
sity  approximation.  The  initial  grid  is  chosen  to  cover  the  support  of  the  given  initial 
prediction  density.  The  density  approximation  is  then  =  o.d  z€lo.,, 

»oi  =  ■“  r  di,  where  is  the  initial  prediction  density. 

a 

*0,1 

The  last  point  to  be  discussed  is  the  selection  of  the  number  of  grid  points  along 
each  axis  of  the  grid  system.  This  number  must  be  large  enough  to  provide  sufficient 
accuracy  of  the  approximations,  but  should  be  as  small  as  possible  to  minimize  the  com¬ 
putational  burden.  There  may  also  be  some  restrictions  on  the  number  of  grid  points 
due  to  the  FFT  algorithm  chosen,  although  these  are  usually  minor.  Unfortunately, 
there  are  no  firm  guidelines  to  help  here.  Probably  the  best  approach  is  to  start  with  a 
small  number  and  increase  it  until  you  get  reasonable  accuracy  for  the  particular 
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application.  As  we  will  see  in  the  numerical  examples  in  chapter  4,  we  can  get  surpris¬ 
ingly  good  results  with  relatively  few  points. 

3.5.  Analysis  of  Error  Propagation 

For  classical  point  estimation  techniques  the  major  concern  is  whether  the  point 
estimate  converges  to  the  true  value,  and  if  so,  how  fast.  Here,  since  we  are  actually 
calculating  an  approximation  to  the  posterior  density,  the  question  is  instead  whether 
the  density  approximation  diverges  from  the  true  density.  In  other  words,  we  are  con¬ 
cerned  with  bounding  the  maximum  growth  of  the  error  instead  of  finding  the  minimum 
shrinkage. 

We  turn  now  to  characterizing  the  error  that  accumulates  in  the  prediction 
density  approximation  as  we  execute  the  recursion  (3.14).  As  we  will  show,  there  is  an 
upper  bound  on  the  growth  of  the  error  from  step  to  step.  This  bound  is  the  best  possi¬ 
ble,  in  the  sense  that  there  are  pathological  situations  in  which  the  bound  is  achieved. 
It  is  not,  however,  what  one  would  call  reassuringly  tight,  since  it  allows  fairly  rapid 
error  growth.  Fortunately,  in  deriving  the  bound  we  are  also  able  to  characterize  the 
situations  which  lead  to  large  error  growth,  so  that  we  can  say  that  generally  the  error 
growth  will  be  much  less  than  its  bound. 

For  this  section  we  will  make  two  simplifying  assumptions  to  help  decrease  the 
notational  clutter.  All  the  conclusions  also  apply  in  the  general  ceise.  First,  we  restrict 
our  attention  to  the  scalar  case.  Among  other  things,  this  allows  us  to  use  the  represen¬ 
tation  i=Q{i+i)+/9,  Si  (-,),  for  i6l,,  where  the  scalars  q  and  P  define  the  current  grid. 
Second,  we  assume  the  system  dynamics  function  /  is  invertible.  With  this  assumption, 
we  can  write 

V-lv)  =  »*(/  '(v))  ?(v) 

where  j(i/)  -|  dj  '(y)ldy  \  .  Note  that  V  is  not  normalized,  so  that  it  corresponds  to  the 
earlier  derivation. 

We  begin  by  assuming  that  we  have  7r"t(z),  the  prediction  density  approxiina- 
tion  at  time  k.  The  error  is  x-jr-jf,  and  in  practice  is  unknown  since  we  would  not 
know  X.  Note,  however,  that 

J'  if(z)  dz  =  J"  x(z)  dz  -  J’  x'li)  dz  =  1  -  1  =  0 
Hence,  we  do  know  that  x  oscillates  about  zero,  since  its  average  over  its  support  is 
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zero.  Let  v  be  bounded  by  ;  ift(z)i  ^ 

Now  consider  yi.  It  is  approximated  by  v'(v)“V';!  Vf  V-’i  defined  by  (3.14b). 
For  yt  1,  we  gel 

V'(v)  vly)  -  v-,  “  (x‘  'ly).*)9(y)  -  v-, 

^  » (/■‘(»))/‘(/'*(y).2)y(y) 

+  » (/“‘(y))/'{/''(y).‘)y(y)  -  —  I  *  (/  ‘(y))M(/"‘(y)>2)y(y)‘<y 

Define 

^.(y)^  »{/  '(y))M(/  ‘(y).^)y(y)  (3.21) 

^.(y)  » (/"'(y))>‘(/'‘(y).^)y(y)  -  » (/''(y))M(/  '(y)>^)y(y)<^y  (3.22) 

The  term  is  the  ‘carryover’  error.  It  is  the  error  in  ip  from  using  the  density  approxi¬ 
mation  if  instead  of  the  actual  density  n .  From  the  assumed  bound  on  x,  we  get 

I  ^,(y)l  =  I  »  M  y I  ^  ”  t  ^  9  =  ‘t4  \<'(y)  (3.23) 

The  other  term,  is  the  ‘approximation’  error.  It  is  the  error  which  would  result 
from  approximating  V'  by  ^  if  if  were  the  true  density.  This  error  is  (conceptually  at 
least)  calculable,  since  if  and  p  are  known.  Suppose,  then,  that 

l^^.(y)l  ^  ‘y.  V’(y)  (3.24) 

Since  we  are  approximating  with  a  piecewise  constant  function,  we  expect  that  ^  will  be 
a  worse  approximation  (t^  will  be  greater)  as  tp  becomes  steeper. 

It  is  worth  emphasizing  that  ^nd  V'o  arise  from  distinct  sources.  If  k  i,=if  i,, 
so  that  we  had  no  error  in  the  last  prediction  density,  would  be  zero,  but  would 
be  unaffected.  Likewise,  ip,  is  unaffected  by  the  quality  of  the  approximation  of  V  as 
reflected  by  ip„.  Two  other  features  of  these  error  terms  should  be  noted;  first, 
^V'o(y)‘fy  0,  and  second,  even  though  (z)dz  0,  it  is  very  likely  that  0. 

Hence,  ip„  locally  averages  to  zero,  while  ip,  is  not  guaranteed  to  average  to  zero  even 
over  its  entire  support. 

Next,  substitute  -t  in  the  convolution  for  We  get 


26 


=-  C'  ‘  f  vJj/)  +  vMy))  ’•(r  y)  dy 

(3.25) 

Note  llie  iiilegral  in  the  first  lerin  above.  Its  piecewise  constant  approximation  for  re  1, 
isr,_j  given  by  (3.14c).  Therefore  define 

f,  ^  »■((“(«  -^)-^|d)-y)  dy  r..j 

As  with  this  error  is  calculable  since  r  is  a  known  function.  Accordingly,  2issume 

I  ^.-,(<5)1  ^  J  »■((«(»  ^^) +/^)- y) ‘fy 


-  c 


1  Jv’a(y)^(^  -  y)dy  ^  f^’c(y) 


r{x- y)dy 


Notice  that  J  r,(d)d6^0,  so  f  also  locally  averages  to  zero.  Substituting  the  above  into 
(3.25)  yields 

»^*+i(*)  =  +  5]V’/^.-,('5)  -t  J'^.(y)t(i-y)dy 


+  f^,(y)’’lf-y)dy 


(3.26) 


The  normalizing  constant  C  is  obtained  by  solving  J' n(z)dz  =  l,  which  results  in 

■  f  .»  ^  SV'/.-;(^)  +  J^a(y)’'(z-y)<fy  +  J'^.(y)»'(*-y)<fy | 

^  a  X)  V-,  ^  f^t(z)fi(z,2)dz  (3.27) 


where  we  have  used  the  fact  that  f  and  integrate  to  zero.  Refer  back  to  equations 
(3.l4d)  and  (3.l4e).  If  we  let  C  be  defined  by  (3.l4d)  and  substitute  this  and  equation 
(3.l4e)  into  (3.26)  and  (3.27),  we  get 


'*ti(*)  =  ,(i)  -f  f^'a{y)Tlz-y)dy  +  Jt^,(y)r(r-  yjdy  I 


From  this  it  is  obvious  that  the  error  functions  are 
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^  "  f  t(') 

■”  *-.i(*)  |e^'j^  .(('^)  / V’«(y)'^(i  y)'^y 

1  f^fly)r(x  y)dy  I 


(3.28) 


(3.29) 


We  will  examine  each  term  in  turn. 

First  consider  C  from  (3.28).  Using  the  assumed  bound  on  we  gel 

I  $  J|  x*(i)|  /i(r,2)di  ^  ^{x)n(x,2)dz  =  f  Jtl»(y)dy  =  t  , 

The  bound  on  this  error  is  only  attained  in  the  unlikely  event  that  fi  is  concentrated  on 
a  set  where  attains  one  of  its  extremes.  At  the  other  extreme,  if  /i  is  constant  over 
the  support  of  x*,  C=0,  since  Generally,  we  would  expect  C  to  be  small 

when  fi  is  spread  out,  while  a  more  concentrated  ft  would  increase  the  likelihood  of  a 
larger  error. 

The  first  term  of  (3.29)  is  the  normalization  error  in  n  due  to  the  error  in  the 
normalizing  constant  C.  If  C  is  small,  this  term  will  likewise  be  small.  In  any  case,  this 
is  a  fairly  benign  error,  since  it  is  a  simple  scaling  error  and  affects  all  the  intervals 
equally. 

Next  consider  the  second  term  of  (3.29).  Using  the  assumed  bounds  on  f, 
and  tj/t ,  this  term  can  be  bounded  by 


r((a(i+<5)+/3)  V)  dy 


=  r((a(«+<5)  t/9)  y)  dy 

)/V'(y)  r  ((a(i -t(5)  +  y )  dy  -  tr(l+<^^+t,j)x  t  +  ,(i) 


'I'lie  first  inequality  is  achieved  only  if  |  f  |  reaches  a  maximum  at  the  same  6  for  all  «; 
otherwise,  the  error  is  strictly  less  than  its  bound.  The  second  inequality  defiends  on 
tin-  behavior  of  the  convolution  of  and  with  r,  and  will  be  discussed  when  we  get 
to  the  third  and  fourth  terms  of  the  error  equation.  Note  that  this  term  is  analogous  to 
v>a.  It  is  the  error  which  would  result  from  approximating  x  with  x"  if  0  were  the  true 
density  (V'=V')-  Hence  we  would  expect  it  to  depend  on  how  steep  x  is.  We  can  investi¬ 
gate  this  further  by  making  the  rough  approximation  that  fr{x-y)dy=T,_j  at  x  =  ai+fi, 
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and  is  linear  between.  I'his  implies 

j(<^)  (O-y) 

Using  this  approximalit>n  gives 

j  j 

So,  roughly  speaking,  this  term  depends  directly  on  the  rale  of  change  of  if  (and  hence 
71 ),  as  We  would  intuitively  expect. 

Now  look  at  the  third  term.  This  is  the  propagation  of  the  approximation  error 
in  ill  through  the  lime  update  convolution.  With  the  bound  on  Hi,  from  (3.24),  we  have 

I  f^a(u}r{i'V}dpl  ^  *-n(*) 

Recall  that  ^V’«(v)dv ^0.  Hence  if  r  is  reasonably  constant  over  intervals  of  length  q, 

then  this  term  will  be  near  zero.  Another  way  to  look  at  this  is  to  recognize  that  the 
convolution  is  an  averaging  operation.  Since  locally  averages  to  zero,  we  would 
expect  the  convolution  to  be  near  zero  as  long  as  r  is  locally  nearly  constant. 

Lastly,  consider  the  fourth  term  of  (3.29).  This  term  results  from  the  propaga¬ 
tion  through  the  current  cycle  of  the  recursion  of  the  error  in  the  prediction  density 
from  the  previous  time  step.  Using  the  assumed  bound  on  the  previous  prediction  den¬ 
sity  error  x*,  and  the  same  steps  as  in  the  above  paragraph,  we  obtain 

I  J'V',(y)r(i'-v)dy  1  sj  J|  »*(/  ‘(y))l /r(/''(y).^)9(y)'^(*-y)‘^y 

‘r,J'’r»(/  ‘(i/))/i(/‘‘(i,),r)</(;,)r(i  y)dy  -  «  ..Cx  *  +  ,(2) 

The  behavior  of  this  term  is  much  the  same  as  the  last,  except  that  we  require  the  pro¬ 
duct  /ijr  to  be  reasonably  constant  over  the  whole  support  of  x*,  instead  of  just  locally. 
This  is  because  we  can  only  assert  that  the  entire  integral  of  jf*  is  zero,  rather  than  the 
stronger  statement  that  the  integral  over  intervals  of  the  grid  is  zero.  Clearly,  this  is 
more  difficult  requirement  to  tnect.  Even  so,  if  //  and  r  are  not  excessively  concen¬ 
trated,  we  can  expect  this  term  to  be  significantly  less  than  its  bound.  This  behavior  is 
important,  since  it  implies  that  we  can  expect  past  errors  to  be  forgotten  to  some 
extent.  This  in  turn  indicates  there  is  a  possibility  of  reaching  a  steady  state  error  level, 
where  the  error  introduced  at  eiich  iteration  is  roughly  balanced  by  the  attenuation  of 
past  errors. 
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Ttif  total  error  bound  is  obtained  b>  toinbining  all  the  bounds  for  the  indivi¬ 
dual  terms  in  (3.29).  1'his  produces 


*.  ‘  'i)  "  *  ’  i(^) 


Thus,  the  growth  of  the  error  for  each  iteration  is  bounded  in  this  sense.  Lnfortunately , 
this  is  not  a  particularly  restrictive  bound.  For  example,  a  relatively  small  5%  error  in 
each  of  the  constituent  terms  allows  nearly  22%  error  for  the  total. 

The  above  covers  the  behavior  of  the  error  in  the  density  approximation.  In 
light  of  the  interpretation  of  the  recursion  as  a  mass  fdter,  it  is  also  useful  to  consider 
the  behavior  of  the  error  in  the  mass  assigned  to  each  interval.  Let 
-  Frob(it ,  jt  1,1  Zj).  Then 


,)(i) 


The  approximate  probability  P,  generated  by  the  algorithm  is  given  by 


(i)  dz 


u”  k,i., 


lienee  the  error  is 

P.  -  P.  P.^  -  ’r-,„(z})  dz  - 

Using  (3.29)  to  substitute  for  for  le  I,,  we  get 


P. 


I 


C  ,  ,  1 


y)dy  Mi 


ll‘ 


(3.30) 


d'ln'  ititegral  of  course  distributes  over  the  sum.  'I'he  first  term  is  simply  a  constant 
times  .  'I'he  second  term  is  zero  since  the  integral  over  any  interval  of  f,.,  is  zero. 
'I'he  third  term  can  be  bounded  by 


J dz  fd]/^. 


[»)Mi"y)l 


Similarly,  the  fourth  term  is  bounded  by 


Jdzfdy 


0a(v)»'U-V)! 


Hence  the  total  error  in  the  approximation  to  the  probability  mciss  is  bounded  by 


lAl  ^ 


Note  that  this  bound  depends  on  the  error  in  the  density  approximation  from  the  previ¬ 
ous  step.  Hence,  although  the  error  in  the  mass  approximation  is  less  than  the  error  in 
the  density  approximation  at  each  step,  it  can  grow  as  rapidly. 

In  addition  to  the  errors  in  the  density  and  mass  approximations,  we  may  also 
be  concerned  with  the  error  in  the  ultimate  decision  from  using  the  approximate  density 
instead  of  the  true  density.  In  general,  the  decision  will  depend  on  the  expected  value 
V  of  some  cost  functional  v.  Following  the  approach  used  above,  we  see  immediately 
that  the  error  in  using  the  approximation  to  V  is  given  by 

^  =  J  »*+i(*)  «'(*)  dx 
If  0,  the  error  can  be  bounded  by 

I  F|  <  J*  I  x*+,(i)|  «(i)  dz  $  t,  J  Jr*  +  i(z)  v(i)  rfi  =  <,  V 

where  e,  is  the  bound  for  the  error  in  the  current  density  approximation.  Recall  though 
that  J'rfj,,(z)dz=0,  so  if  the  cost  functional  is  not  overly  concentrated,  we  can  expect 
the  error  to  be  signiHcantly  less  than  the  bound.  Clearly  if  v  happens  to  be  concen¬ 
trated  in  the  wrong  places,  then  the  error  could  approach  the  bound,  but  this  would  be 
unlikely.  Once  again  we  arrive  at  the  intuitively  reasonable  characterization  that  the 
errors  will  be  aggravated  by  sharply  spiked  functions.  If  v  takes  on  both  positive  and 
negative  values,  we  cannot  bound  the  error  in  this  way. 

As  we  have  seen,  the  error  bound  for  the  recursion  can  grow  at  an  uncomfort¬ 
able  rate.  Fortunately,  the  conditions  when  the  errors  reach  their  bounds  are  fairly 
unusual.  In  general  terms,  as  lung  as  the  densities  remain  reasonably  evenly  spread  out, 
the  errors  should  be  substantially  below  the  bounds.  This  matches  with  our  intuitive 
expectation,  since  the  piecewise  constant  approximations  that  are  being  used  are  best 
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when  the  densities  are  reasonably  flat  in  the  scale  of  the  grid.  This  in  turn  implies  that 
we  can  reduce  the  error  by  increasing  the  number  of  grid  points.  In  fact,  the  piecewise 
constant  approxin.ations  converge  uniformly  for  the  class  of  continuous  densities  as 
N-oo.  so  the  recursion  would  be  exact  for  densities  in  that  class.  We  can  still  use  this 

technique  for  discontinuous  densities,  but  we  cannot  guarantee  convergence  as  N 
iticreases. 

hi  the  next  chapter  we  will  look  at  a  number  of  numerical  examples  to  show  the 
typical  error  behavior  of  this  algorithm. 


4.  NUMERICAL  EXAMPLES  OF  THE  ALGORITHM 


4.1.  The  Linear  Gaussian  Case 

In  tills  chapter  we  consider  some  examples  of  the  algorithm  in  use.  The  first  is 
the  linear  Gaussian  case.  This  case  is  important  as  a  benchmark  since,  as  we  pointed 
out  earlier,  the  Kalman  filter  is  its  Bayes’  solution.  Hence  we  can  compare  the  approxi¬ 
mate  solution  provided  by  the  algorithm  with  the  exact  analytical  solution.  As  we  will 
see,  the  algorithm  performs  quite  well. 

Implementing  the  recursion  of  (3.14)  for  a  scalar  system  with  linear  dynamics  is 
very  straightforward.  To  begin,  we  will  use  a  time  invariant  system.  The  model  is 

^:*  +  l  =  Fz*  +  ui* 

2*  =  //Zj  -t-  V^ 

where  and  v*  are  zero-mean  Gaussian  with  covariances  Q  and  R  respectively.  Hence 

/*(z,2)  =  (2x7?)-  exp(— 

r(z]  ^  (2n  Q)- 

Since  the  algorithm  demands  that  ft  and  r  have  finite  support,  we  truncate  the  Gaus¬ 
sian  densities  at  plus  and  minus  three  standard  deviations.  The  approximation  grid  is 
simply  a  collection  of  intervals  on  the  real  line,  defined  by  the  two  scalars  a*  and 
With  this  information,  we  can  now  implement  the  equations  in  (3.14).  The  inverse 
image  sets  H,  (equation  (3.14a))  are  single  intervals,  given  by 
((“*ti(*~)  +  /^*+i)/^’.(®*+i(‘+)  ^^*+i)/^’)-  a  simple  matter  to  calculate  the  intervals  in 

the  old  grid  that  contain  th**  endpoints  of  each  H,,  and  from  that  identify  the  non¬ 
empty  intersections  in  equation  (3.14b).  Generally,  only  two  or  three  intervals  in  the 
old  grid  arc  involved.  The  remaining  equations  are  implemented  directly,  and  do  not 
require  further  discussion. 

As  discussed  in  the  previous  s  ■  n,  we  are  most  concerned  with  the  possibility 
that  the  accuracy  of  the  density  approximation  may  degrade  as  time  progresses.  We 
can  begin  by  visually  comparing  the  approximate  density  with  the  exact  density 
obtained  from  the  Kalman  filter.  Since  we  are  going  to  look  at  the  density  after 
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cunsiderabic  titiu-  has  passed,  we  will  use  F--\  so  that  the  syslein  has  reasonable  long- 
lerin  behavior.  Figures  4.1-3  show  ihe  cotiiparison  after  500  iterations  for  a  32  point 
grid  and  three  toinbinations  of  noise  covariances.  Visually,  at  least,  the  approximate 
density  is  a  good  repre.sentation  of  the  actual  density.  Reviewing  the  density  comparis¬ 
ons  over  the  complete  history  reveals  that  the  approximation  maintains  nearly  the  same 
degree  of  fidelity  througiioul.  'I'his  tends  to  support  the  observation  from  the  previous 
.section  that  the  error  may  reach  a  steady  stale  condition  instead  of  constantly  increas¬ 
ing. 

Visually,  the  approximation  seems  stable,  but  what  can  we  say  numerically? 
The  obvious  choice  of  error  to  calculate  is  the  ratio  error  bound  defined  in  the  previous 
chapter.  Unfortunately,  calculation  of  that  error  measure  is  complicated  by  the  pres¬ 
ence  of  two  additional  sources  of  error:  first,  the  error  introduced  by  the  finite  word 
length  FF'T,  and  second,  the  error  resulting  form  the  truncation  of  the  Gaussian  densi¬ 
ties.  Both  these  contributions  are  of  the  same  magnitude  as  the  density  itself  near  the 
tails.  This  can  result  in  arbitrarily  large  error  ratios  as  one  moves  away  from  the 
center.  Thus,  the  error  criterion  of  the  last  section  tends  to  give  inflated  results.  Res¬ 
tricting  attention  to  the  central  intervals  minimizes  these  contributions  but  is  not 
rigorously  justifiable.  Even  so,  it  can  provide  information  on  the  general  trends  of  the 
error,  so  we  will  use  the  ratio  error  restricted  to  the  interval  ±  i.ha  about  the  mean  of 
the  actual  density.  In  other  words, 

RatioError  ^  max 

where  n  and  o  are  those  of  the  actual  density.  Figures  4.4-6  show  this  measure  versus 
time  for  the  same  situations  as  above,  averaged  over  20  runs.  Note  that  the  error 
remains  more  or  less  constant  over  the  entire  run.  This  again  supports  the  proposition 
that  the  algorithm  achieves  a  steady  state  error.  Another  useful  numerical  comparison 
is  between  the  moments  of  the  approximate  density  and  those  of  the  true  density.  Fig¬ 
ures  1.4-6  also  sh«)w  the  averaged  error  in  the  first  two  central  moments.  The  error  in 
the  mean  is  given  in  percent  of  the  actual  standard  deviation.  The  error  in  the  variance 
is  in  percent  of  the  actual  variance.  Once  again,  the  errors  are  seen  to  be  reasonably 
constant. 


How  do  these  results  depend  on  the  particular  parameters  used  for  these  first 
examples?  There  are  four  different  parameters  to  consider:  the  system  dynamic 
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Approximate  vs  Actual:  LTI  System 
1.0,H=1.0,Q=6.0,R=6.0.DLIM=3.0,N=32  K=500 


Figure  4.3  -  Actual  versus  approximate  density  for  linear  time 
invariant  system  with  Gaussian  noise,  after  500  time  steps. 
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Average  trror  Jiisto 
=1.0,H=1.0,Q=2.0.R=4.0,DLIM=3.0 


Figure  4.4  -  Average  error  history  of  approximate  density  for 
linear  time  invariant  Gaussian  system. 
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Figure  4.5  -  Average  error  history  of  approximate  density  for 
linear  time  invariant  Gaussian  system. 
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coiisiciiil  F.  the  noise  variances,  the  number  of  grid  points,  and  llie  point  at  which  the 
Ciaussiaii  densities  are  truncated.  The  first  two  are  parameters  of  the  system,  and  tlie 
second  two  are  parannUers  of  tfie  approximation  algorithm.  We  will  use  the  time- 
averaged  nu'an  and  variance  of  the  numerical  error  measures  defined  above  to  look  at 
the  effects  these  parameters  have.  I'able  4.1  summarizes  tlu“  results  of  the  various  cases 
to  be  considered . 


Look  first  at  changing  the  system  constant  F.  Figures  4.7-9  show  the  resulting 
density  approximations  after  20  time  steps  for  three  different  cases:  stable,  unstable, 
and  oscillatory  systems.  We  would  expect  these  changes  to  have  little  effect  on  the  per¬ 
formance  of  the  algorithm,  and  the  figures  and  the  data  from  Table  4.1  bear  this  out. 
The  variations  in  the  error  statistics  are  very  small. 


Next  consider  different  system  noise  variances.  Figures  4.1-3  showed  three 
situations  different  only  in  the  noise  variances,  and  visually  there  was  little  difference. 
However,  looking  at  the  numerical  data  in  Table  4.1,  we  notice  a  pronounced  change  as 
the  ratio  of  system  input  noise  variance  to  observation  noise  variance  changes.  Note, 
though,  that  it  is  the  ratio  that  is  important,  not  the  individual  magnitudes.  The 


statistics  for  the  base  case  {Q  ^2,  H  =  4,  so  -^^0.5)  are  virtually  identical  to  those  for 

R 

Q  V>,  H  \2  (-^^0.5  again).  As  the  ratio  of  the  variances  gets  smaller,  the  errors 

A 


increase  in  all  categories.  This  is  because,  as  the  ratio  gets  larger,  the  data  in  the 
current  measurement  is  weighted  more  relative  to  the  data  in  the  old  prediction  density. 
1  his  effectively  reduces  the  dependence  on  the  past  data,  and  hence  reduces  the  carry¬ 
over  of  old  errors.  In  terms  of  the  densities,  the  smaller  the  ratio  becomes,  the  more 
sharply  spiked  /i  (and  hence  v)  becomes  compared  to  r.  Thus  jf,  the  result  of  the  con¬ 
volution  of  7  and  Vi  looks  increasingly  like  a  shifted  version  of  r,  with  only  the  error 
associated  with  approximation  of  r  and  no  carryover  from  past  iterations.  So  the  algo¬ 
rithm  performance  does  depend  to  .some  degree  on  the  system  parameters,  but  in  a 
predictable  and  reasonable  way. 


V\e  turn  now  to  varying  the  algorithm  parameters.  Figures  4.10-12  show  the 
density  approximations  for  8,  12,  and  IG  grid  |)oints.  As  we  would  expect,  the  fidelity 
of  the  approximation  improves  as  we  increase  the  number  of  grid  points.  Similarly,  the 
data  in  'Fable  4.1  shows  decreasing  errors  as  the  number  of  grid  points  increases. 
Interestingly,  even  with  as  few  as  8  points,  we  get  a  fair  representation  of  the  density. 
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Figure  4.12  -  Actual  versus  approximate  density  with  64  grid 

intervals. 
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Average  over  1000  time  steps 

Tal)le  4.1  -  Summary  of  Krrors  in  Approximate  Density  as  Parameters  are  Varied 


Despite  the  rather  large  ratio  errors  for  that  case,  the  moments  are  the  approximate 
density  are  not  too  far  off.  The  average  error  in  the  mean  is  less  than  1%  and  95%  of 
the  time  the  approximate  mean  is  within  .56a  of  the  actual  (based  on  a  Gaussian  error 
distribution  and  a  2a  bound).  It  is  also  worth  noting  that  we  begin  to  get  decreasing 
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returns  as  we  go  from  16  to  32  and  then  to  64  grid  points.  This  points  out  the  impor¬ 
tance  of  balancing  desire  for  numerical  accuracy  and  computational  burden.  In  practice, 
it  may  be  necessary  to  experiment  at  the  beginning  to  select  a  reasonable  number  of 
grid  points  for  the  particular  application. 

Finally,  look  at  changing  the  truncation  point  for  the  Gaussian  densities.  Fig¬ 
ure  4.13  shows  the  resulting  approximate  density  when  the  Caussians  are  truncated  at 
±  4a ,  and  Figure  4.14  shows  the  result  for  ±  2.5a.  Visually,  there  appears  to  be  little 
difference.  Numerically,  however,  we  note  a  degradation  for  both  cases  in  comparison  to 
the  base  case.  As  we  increase  the  truncation  point,  we  include  only  tiny  additional  pro¬ 
bability  mass,  but  force  the  grid  to  cover  a  larger  interval  with  the  same  number  of 
points.  As  a  result,  the  densities  appear  more  sharply  peaked,  and  so,  as  predicted  by 
the  error  analysis  of  the  previous  chapter,  the  ratio  error  increases,  as  well  as  the  errors 
in  the  mean  and  variance.  On  the  other  hand,  reducing  the  truncation  point  discards 
an  increeisingly  significant  amount  of  mass,  even  though  it  improves  the  grid  coverage  in 
the  central  region  of  the  density.  This  causes  the  observed  reduction  in  ratio  error  with 
increased  variability  of  the  moment  errors.  The  increased  variance  of  the  moment 
errors  for  the  2.50  cutoff  also  occurs  because  reducing  the  truncation  point  aggravates  a 
problem  that  any  truncation  of  a  infinite  density  can  generate.  As  you  can  see  in  Fig¬ 
ure  4.14,  the  approximate  density  for  this  case  has  a  fairly  obvious  skew  to  the  right 
relative  to  the  actual  density.  This  is  caused  by  a  noise  sample  near  the  truncation 
point.  The  multiplication  for  the  measurement  update  will  now  emphasize  the  region 
near  the  truncation,  resulting  in  a  distortion  on  one  side  of  the  density.  The  more 
severely  the  density  is  truncated,  the  worse  the  problem  will  be.  Care  must  be  taken  in 
selecting  the  truncation  point  for  a  particular  application,  and,  as  with  selection  of  the 
number  of  grid  points,  some  experimentation  may  be  required. 

Two  more  features  of  the  algorithm  are  worth  noting.  First,  the  error  in  the 
variance  appears  to  have  a  definite  positive  bias.  In  fact,  from  looking  at  the  average  of 
the  absolute  value  of  the  error,  it  seems  that  the  variance  of  the  approximate  density  is 
almost  always  greater  than  the  actual  variance  for  most  cases.  In  other  words,  the 
approximate  density  almost  always  overestimates  the  variance.  This  is  intuitively  rea¬ 
sonable  when  you  consider  the  form  of  the  approximation.  Over  any  interval,  the 
approximation  is  usually  greater  than  the  actual  density  on  the  side  away  from  the 
mean,  and  smaller  on  the  side  closer  to  the  mean.  Hence  the  approximation  will  show  a 
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greater  spread  than  the  actual.  For  the  mean,  however,  the  differences  will  tend  to  bal¬ 
ance  each  other,  so  we  do  not  see  a  bias  in  the  mean  of  the  approximate  density.  This 
is  a  very  nice  property,  since  it  says  that  the  algorithm  generates  a  consistently  conser¬ 
vative  approximation  to  the  actual  density.  The  two  exceptions  to  this  general  observa¬ 
tion  are  the  cases  for  N  =  64  and  truncation  at  i  2.5(7.  In  both  cases  the  effect  of  lost 
mass  beyond  the  truncation  point  cancels  the  effect  described  above. 

Second,  you  may  have  noted  from  the  graphs  that  the  interval  which  is  the 
peak  of  the  approximate  density  often  contains  the  peak  of  the  actual  density.  In  the 
runs  made  to  generate  the  data  in  Table  4.1,  this  occurred  about  95  percent  of  the  time, 
consistently  for  all  cases.  This  effect  is  related  to  the  interpretation  of  the  algorithm  as 
a  probability  mass  Filter.  The  algorithm  almost  always  assigns  greatest  mass  to  the 
appropriate  interval. 

To  close  this  section,  let’s  now  relax  the  time-invariant  restriction.  Figure  4.15 
shows  the  density  comparison  for  a  system  with  Gaussian  noises  with  time-varying  vari¬ 
ances  given  by 

Qk  -  <?o  +  M^)  ;  Hk  -  Ho  +  sin(^) 

As  you  can  see,  the  algorithm  handles  this  case  without  any  difficulty.  Numerically,  the 
errors  are  comparable  to  the  base  time-invariant  case. 


4.2.  Sign-only  Observations 

We  have  seen  in  the  last  section  that  the  approximate  algorithm  generates  a 
well-behaved,  stable  approximation  to  the  actual  prediction  density.  There  is  little  use 
in  continued  application  of  the  algorithm  to  the  linear  Gaussian  case  of  the  last  section 
though,  because  we  already  have  an  exact  solution  for  that  case.  More  interesting  is  to 
look  at  nonlinear  non-Gaussian  systems,  and  use  the  algorithm  to  explore  their 
behavior.  For  example,  having  a  representation  of  the  complete  posterior  density  allows 
us  to  actually  calculate  the  optimal  estimate  for  any  given  cost  function.  Thus  we  can 
compare  the  performance  of  any  approximate  point  estimator  to  the  theoretical 
optimum.  Alternatively,  w«'  can  use  the  algorithm  to  look  at  the  effects  of  different  sys¬ 
tem  structures  on  the  information  in  the  observations.  Many  applications  are  possible. 

As  a  simple  illustrative  example,  consider  a  system  with  linear  dynamics  and 
Gaussian  input  noise,  but  where  we  observe  only  the  sign  of  the  output.  The  system  is 
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described  by 


J:*>i  Fz^  -+  u-i 
2*  -  sgn(it  -t  ut) 

where  sgn(r)  is  the  sign  function  defined  by 
1 1  z  0 

It  is  interesting  to  note  that  although  this  is  a  simple  nonlinearity,  there  is  no  EKF  for 
this  system,  since  we  cannot  linearize  the  sgn  function. 

It  is  a  simple  modification  to  apply  the  algorithm  to  the  above  system,  since  the 
algorithm  is  nicely  modular.  The  only  change  in  the  system  is  in  the  observation  equa¬ 
tion,  so  the  only  change  in  the  algorithm  is  in  the  measurement  density  fi.  For  this  case 
we  have 

where  6  is  the  Dirac  delta  or  ‘impulse’  function,  and  e,  and  c_i  are  given  by 

00  2 

e,(2)  =  Prob(z  +  v5  ‘^)  "  /  (2»r*)  *exp(-^)  dy 

**  * 

e  i(i)  -  Prob(z+v<0)  =  J  (2ji  R)  ^  l~«i(*) 

“OO 

Other  than  this,  the  algorithm  is  implemented  exactly  as  it  was  for  the  first  section. 
Also  as  in  the  first  section,  we  will  use  a  32-interval  grid,  and  truncate  the  Gaussians  at 
±  3ff . 

One  thing  that  would  be  nice  to  know  for  this  system  is  how  much  we  lose  only 
having  the  sign  for  the  observation.  Or  conversely,  how  much  could  we  gain  if  we  had 
full  observations  (sign  plus  magnitude).  I'o  at  least  start  to  answer  this  question,  we 
can  compare  the  approximate  prediction  density  for  the  sign-only  system  to  the  output 
from  a  Kalman  filter  operating  on  the  full  measurements.  After  a  little  experimenting 
with  the  system  it  becomes  apparent  that  there  are  two  basic  regimes  to  consider. 
First,  where  the  state  remains  in  the  vicinity  of  zero,  and  second,  where  it  does  not.  In 
the  first  case,  we  get  reasonably  frequent  switches  in  the  sign  of  the  output.  As  might 
be  expected,  this  leads  to  a  roughly  Gaussian  shaped  density.  Figure  4.16  shows  such  a 
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silualiun.  Note  lhal  the  means  of  the  ivvo  densities  are  tjuile  close.  Tlie  density  for  the 
sign  only  case,  however,  has  a  noticeably  larger  variance,  indicating  the  reduced  infor¬ 
mation  content  of  the  sign-oidy  observations.  In  the  second  mse,  where  the  state  drifts 
away  from  zero,  the  observations  are  consistently  the  same  sign.  Now  we  know  only 
which  side  of  zero  the  state  is  on,  but  not  how  far  away  it  is.  We  would  ex|)cct,  then, 
that  the  density  would  begin  to  smear  out,  displaying  a  large  degree  of  uncertainty 
about  the  state.  Figure  4.17  shows  the  two  densities  for  this  case.  Again,  the  means  of 
the  two  densities  are  not  grossly  different,  but  the  variance  of  the  sign-only  density  is 
considerably  larger.  The  sign-only  density  is  also  clearly  skewed  away  from  zero,  show¬ 
ing  the  larger  degree  of  uncertainty  in  that  direction. 

Looking  at  the  time  history  of  the  density  gives  additional  information.  For 
simplicity,  we  will  just  consider  the  mean  of  the  two  densities,  since  these  provide  a  rea¬ 
sonable  measure  of  the  relative  locations  of  the  densities.  Figure  4.18  shows  this  for  the 
same  system  parameters  as  in  Figure  4.16.  Note  the  extreme  response  of  the  sign-only 
density  to  changes  in  the  sign  of  the  observation.  As  noted  before,  this  is  because  the 
data  does  not  contain  information  on  how  far  the  state  has  moved.  The  latter  half  of 
the  graph  shows  the  effect  of  the  sign-only  density  spreading  out  in  the  positve  direc¬ 
tion. 

At  this  point  we  can  make  some  general  observations.  First,  even  at  best,  any 
state  estimate  based  otdy  on  the  sign  of  the  output  will  have  poor  performance  com¬ 
pared  to  having  full  observations.  Second,  a  sign-only  estimator  will  perform  best  when 
the  state  is  near  zero  so  that  there  frequent  changes  in  the  sign  of  the  data.  These  are 
hardly  surprisitig,  and  might  even  be  intuitively  obvious,  but  the  above  is  the  only  rea¬ 
sonable  way  to  actually  demonstrate  them. 

We  could  continue  by  extracting  numerical  data,  and  quantifying  the  behavior 
of  this  system,  but  since  this  example  is  only  intended  to  demonstrate  the  application  of 
the  approximate  density  algorithm,  we  will  not.  It  is  sufficient  to  point  out  that  ana¬ 
lyses  of  this  type  can  only  be  accomplished  by  calculating  the  complete  posterior  den¬ 
sity,  and  that  this  algorithm  provides  a  powerful  means  of  doing  so. 


5.  APPLICATION  TO  PARAMETER  IDENTIFICATION 


5.1.  Introduction 

Well,  what  good  does  this  algorithm  do  us?  The  answer  to  that  is  highly  prob¬ 
lem  dependent.  Clearly,  for  many  combinations  of  systems  and  objective  functions,  the 
traditional  point  estimators  provide  an  acceptable  answer.  Furthermore,  since  this  tech¬ 
nique  can  be  fairly  expensive  computationally,  even  a  poor  point  estimator  may  be 
better  for  some  applications.  Rather  than  futilely  try  to  find  gross  generalizations, 
probably  the  best  approach  is  to  identify  particular  applications,  and  explore  them 
using  this  technique. 

We  can  describe,  however,  several  general  ways  that  the  algorithm  can  be  of  use 
to  us.  First,  simply  working  with  the  equations  and  thinking  of  the  process  in  terms  of 
operations  on  densities  provides  a  deeper  understanding  of  the  mechanics  of  the  process. 
For  more  detail,  we  can  use  the  close  approximation  of  the  posterior  density  that  the 
algorithm  generates  to  plot  the  prediction  density  or  develop  other  useful  descriptions  of 
it.  This  can  provide  additional  insight  into  the  behavior  of  the  system,  and  would  be 
useful  in  much  the  same  way  as  traditional  simulation  is  for  deterministic  systems. 
Furthermore,  as  is  done  with  deterministic  simulation,  we  can  use  the  technique  to 
examine  how  the  behavior  of  the  system  varies  as  we  change  features  of  the  system. 
Finally,  again  since  we  have  a  representation  of  the  posterior  density,  we  can  actually 
calculate  the  optimal  estimator  for  any  given  loss  function.  Thus  we  can  compare  the 
performance  of  traditional  point  estimators  directly  to  that  of  the  optimal.  In  this  way 
we  can  Judge,  for  instance,  whether  apparently  poor  performance  is  a  fault  in  the  esti¬ 
mator,  or  is  actually  just  the  best  that  one  can  expect.  This  could  be  particularly  useful 
in  assessing  transient  performance,  where  there  are  virtually  no  theoretical  results.  To 
consider  these  uses  in  a  concrete  situation,  we  will  look  at  them  in  the  context  of  the 
problem  of  simultaneous  state  and  parameter  estimation.  We  will  look  first  at  using 
this  approach  to  analyze  the  propagation  of  the  prediction  density,  and  then  we  will  use 
it  to  evaluate  the  performance  of  a  point  estimator. 

What  is  the  parameter  estimation  (or  system  identification)  problem?  It  is 
where  we  know  (or  assume  we  know)  the  system  model  except  for  a  finite  number  of 
parameters.  The  object  is  to  be  able  to  estimate  both  the  system  state  and  the  system 


parameters  using  the  observations.  Generally,  the  system  can  be  nonlinear,  and  the 
parameters  time  varying.  In  practice,  however,  virtually  all  the  work  has  been  done  for 
linear  systems  with  time-invariant  parameters.  For  that  case,  it  has  been  possible  to 
take  advantage  of  the  linear  structure  of  the  system  to  obtain  reasonably  effective  point 
estimalors.  A  large  body  of  work  exists  concerning  these  estimators,  which  we  will  not 
try  to  review  here.  These  approaches  generally  arise  from  considering  the  states  and 
parameters  as  distinctly  different  entities.  On  reflection,  though,  there  is  actually  little 
to  distinguish  the  unknown  parameters  from  the  unknown  states  of  the  system.  When 
we  recall  the  Bayesian  definition  of  randomness,  we  see  there  is  really  no  distinction  at 
all.  From  the  Bayesian  viewpoint,  the  unknown  parameters  are  simply  additional  sys¬ 
tem  states  in  a  nonlinear  system  model.  As  Peterka  [3]  and  Ljung  [38]  have  pointed 
out,  this  is  the  only  logically  consistent  approach.  With  this  view  of  the  problem,  it  is 
.straightforward  (although  possibly  algebraically  involved)  to  apply  the  algorithm  to  the 
parameter  identification  problem. 


5.2.  Applying  the  Algorithm 

For  our  first  analysis,  we  will  consider  a  scalar  linear  identification  problem, 
where  we  have  a  model  with  linear  dynamics  with  an  uncertain  transition  parameter. 
The  model  is 

**  +  i  =  Fi*  +  u*  +  w* 

z*  +  V* 

where  the  transition  parameter  F  is  unknown,  and  where  is  a  known  input  sequence 
and  wj  and  v*  are  uncorrelated,  white,  zero-mean,  Gaussian  noise  sequences  with  vari¬ 
ances  Q  and  R  respectively.  The  equivalent  nonlinear  vector  system  that  we  will  use  is 
given  by* 

=  +  [i] 

=  *(3),*  +  V*  =  /1(Z4  ,  V*) 

where  u^,  and  are  as  above.  We  are  now  in  a  position  to  apply  the  algorithm 
1.  Recall  that  the  notation  Z(,)  refers  to  the  ith  component  of  the  vector  i. 
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developed  in  this  paper. 

First,  we  must  deHne  the  grid.  For  simplicity,  we  will  use  a  rectangular  grid, 
defined  by  the  cells 


*(J)€  |a(2),t{j(2)- ^(*).*  >  “(2).*(.>(2)  +  ~)-t  fc(2),*l} 

SO  in  the  definition  given  in  chapter  3,  the  matrix  A  will  be  diagonal.  The  volume  of 
each  cell  is  therefore  given  by  Q*  =  a(,)  *0(,)  *.  At  each  time  step,  we  will  define  the  new 
grid  based  on  the  smallest  rectangle  containing  the  support  of  the  new  prediction  den¬ 
sity.  Note  that  since  there  is  no  noise  on  x^,)  and  no  direct  observation  of  that  state, 
the  support  of  the  density  along  that  dimension  will  not  change.  Hence,  it  will  remain 
the  same  as  the  support  of  the  initial  density,  and  the  grid  parameters  in  that  dimen¬ 
sion  will  be  constant. 

Now  consider  each  of  the  equations  of  the  recursion  (3.14)  in  turn.  First  is 
equation  (3.14a),  which  is  the  computation  of  the  inverse  image  of  a  grid  cell.  Since  the 
mapping  given  by  /  above  is  one-to-one,  the  inverse  image  is  a  closed  figure  like  the  ori¬ 
ginal  cell,  bounded  by  the  inverse  images  of  the  sides  of  the  original  cell.  The  inverse 
image  set  is  given  by 
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Figure  5.1  shows  a  typical  H,  in  relation  to  the  grid  at  the  previous  time  step. 
Next  look  at  equation  (3.14b).  Recalling  that  fi=p(z^  \  Zi),  we  get 


"*(2))* 

M(z,z,)  ^  (2zH)  ) 


Now  consider  the  limits  on  the  integral  of  p.  Referring  to  figure  5.1,  it  is  clear  that 
since  the  grid  in  the  Z(|)  direction  does  not  change  over  time,  we  have  H^n  1*>=0  if 
>(!)/  '(I)-  if  i(i)=‘(i).  we  have 

^  {*(!)€  l<»(i),»+i(y(i)-^)+6(i),k+i ,  <»(i),*+i(j(i)+y)+*(i).*+»!  ; 


1 


The  inner  integral  above  can  be  rewritten  as 
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where  G{x,R)  is  the  cumulative  distribution  function  for  a  Gaussian  with  mean  zero 
and  variance  H.  We  now  have 
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Now  consider  equation  (3.14c).  The  system  time  update  density  r  is  given  by 


1  _  2 

*(2) 


r*(z)  =  rf(z(,))  (2x<?)  *exp( — —) 

where  6{z)  is  the  Dirac  delta  or  ‘impulse’  function.  Hence  we  obtain 
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where  S,  is  the  Kroneker  delta.  As  with  the  equation  for  \i>,  the  above  can  be  rewritten 
in  simpler  form  in  terms  of  the  cumulative  distribution  function.  The  equations  are  now 
detailed  enough  to  implement  in  a  computer  program. 


5.3.  Analysis  of  Density  Behavior 

The  first  step  in  analyzing  the  density  behavior  is  to  develop  the  specific  reali¬ 
zations  of  the  equations  in  the  recursion  for  the  problem  at  hand,  as  we  did  in  the  last 
section.  In  doing  this,  we  are  forced  to  begin  thinking  of  the  process  in  terms  of  opera¬ 
tions  on  entire  densities,  instead  of  on  scalars,  as  we  might  in  other  schemes.  As  we  do, 
we  often  gain  a  deeper  understanding  of  the  mechanics  of  the  density  propagation  just 
by  visualizing  the  general  behavior.  In  doing  so,  it  is  useful  to  think  in  terms  of  three 
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basic  operalions:  first,  the  rnultiplicaliun  of  the  old  prediction  density  and  the  measure¬ 
ment  update  that  yields  the  filtering  density  second,  the  nonlinear  transformation 
due  to  the  system  dynamics  that  gives  the  intermediate  density  ip\  and  third,  the  linear 
convolution  of  tj)  and  the  input  noise  density  that  gives  the  new  prediction  density.  The 
first  operation  tends  to  trim  the  density,  the  second  skews  it,  and  the  third  smears  it 
out  again.  Let’s  look  at  these  operations  in  the  context  of  this  problem.  The  first  thing 
to  note  is  that  the  measurement  density  is  independent  of  the  parameter  Z(|).  In  other 
words,  the  density  looks  like  a  ridge  running  parallel  to  the  Z|i)  axis.  For  convenience, 
imagine  that  the  prior  density  is  completely  uniform  over  the  grid  area.  The  result  of 
the  multiplication  operation  then  would  be  a  duplication  of  the  measurement  density,  a 
ridge  parallel  to  the  zp)  axis.  The  second  operation  rearranges  the  density  according  to 
the  system  dynamics.  In  this  case,  we  map  the  density  at  a  point  (a, 6)  to  the  point 
(a,a6).  This  results  in  the  ridge  line  of  the  density  running  along  a  straight  line  through 
(0,0).  The  spread  of  the  density  in  the  zp)  direction  is  also  affected,  going  to  zero 
toward  Z|,)=0  and  increasing  with  increasing  zp).  Next,  the  convolution  spreads  the 
density  out  along  the  zp|  direction.  The  result  is  a  somewhat  wedge-shaped  ridge  run¬ 
ning  at  an  angle  to  both  axes.  Now  when  we  process  the  next  measurement,  we  again 
multiply  by  a  ridge  parallel  to  the  zp)  axis.  It  is  easy  to  visualize  this  and  see  that  the 
result  is  a  hump  shaped  density  roughly  at  the  intersection  of  the  two  original  ridges. 
As  the  process  is  repeated,  we  see  that  the  density  will  converge  in  both  dimensions. 

At  this  point,  it  is  worth  digressing  to  discuss  what  we  mean  by  convergence  of 
the  density.  As  we  discussed  earlier,  the  density  function  is  interpreted  as  a  measure  of 
the  information  that  we  have  about  some  st^te  of  nature.  The  more  spread  out  the 
density,  the  less  precise  our  knowledge.  Conversely,  the  more  sharply  peaked  the  den¬ 
sity  is,  the  more  sure  we  are.  Hence  we  are  interested  not  only  in  the  location  of  the 
density  but  also  in  its  shape;  we  want  it  to  become  increasingly  peaked  as  we  process 
measurements. 

This  type  of  intuitive  analysis  can  provide  new  insights.  For  instance,  it  makes 
it  clear  that  the  crucial  operation  is  the  twisting  of  the  density  due  to  the  dynamics 
update.  Hence  the  effectiveness  of  an  estimator  rests  on  its  ability  to  account  for  this 
effect.  As  another  example,  it  points  out  the  desirability  of  having  persistently  exciting 
inputs.  Returning  to  the  simplified  visualization  of  multiplying  two  ridge  shaped  densi¬ 
ties,  it  is  clear  that  convergence  in  zp)  depends  on  the  angle  between  the  two  ridge  lines. 
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We  want  to  avoid  the  two  being  nearly  parallel.  The  angle  is  determined  primarily  by 
the  rotation  from  the  dynamics  update,  and  the  degree  of  rotation  is  in  turn  determined 
by  how  far  away  from  zero  the  original  ridge  was.  Therefore,  we  want  to  avoid  operat¬ 
ing  the  system  with  small  state  values.  To  do  so  requires  persistent  inputs,  particularly 
for  stable  systems.  Finally,  we  can  assess  the  impact  of  the  true  value  of  the  transition 
parameter  i(,j  on  the  density  behavior.  Based  on  the  last  point,  we  might  think  that  a 
larger  true  Z(ij  would  be  easier  to  identify  since  it  leads  to  larger  state  values.  But  note 
that  the  spreading  of  the  density  in  the  dynamics  update  increases  with  zp),  and  this  in 
turn  will  degrade  the  convergence  of  the  density  in  Z(i).  It  is  not  clear  which  effect  is 
dominant,  although  1  suspect  that  they  roughly  balance  each  other.  In  this  instance,  we 
would  need  to  go  further  than  the  intuitive  analysis  for  a  definitive  answer.  Note, 
though,  that  whether  the  plant  is  stable  or  unstable  is  not  an  issue;  here  there  is  noth¬ 
ing  magic  about  the  stability  boundary. 

Having  developed  an  intuitive  feel  for  the  propagation  of  the  prediction  density, 
let  us  now  look  at  the  density.  Actually  plotting  the  density  is  probably  the  best  way 
to  visualize  it,  but  may  not  be  really  practical  for  greater  than  two  dimensions.  In  some 
cases,  it  may  be  reasonable  to  plot  two  dimensional  slices  of  the  density,  or  two  dimen¬ 
sional  marginal  densities,  which  can  be  done  using  conventional  plotting.  Alternatively, 
information  such  as  the  first  few  moments,  or  location  of  the  peak  or  peaks  can  provide 
a  good  feel  for  the  shape  of  the  density  in  multiple  dimensions.  The  entire  problem  of 
visualizing  rnulti-dimensional  surfaces  is  little  explored,  and  would  be  worth  pursuing. 

Fur  the  case  at  hand,  there  is  no  trouble  plotting  the  density.  Figure  5.2a 
shows  the  prediction  density^  for  a  low  noise  case,  Q  =  R=0.1.  The  true  parameter  value 
was  1,  and  the  actual  initial  state  was  2.  The  initial  density  was  Gaussian  centered  at 
(2,2).  It  is  obvious  that  the  ridge  line  lies  along  a  line  as  the  earlier  analysis  predicted. 
Figure  5.2b  shows  the  density  at  the  next  lime  step.  As  you  would  expect  in  a  low 
noise  case,  there  has  been  tremendous  convergence.  The  convergence  is  also  helped  by 
the  large  initial  state  value,  which  resulted  in  a  relatively  large  amount  of  rotation  on 
the  first  iteration.  The  rotation  of  the  ridge  line  is  still  discernible  in  this  plot. 

2.  Due  to  the  graphics  software  used  to  produce  the  plots  in  this  chapter,  the  densities  are 
displayed  as  continuous  functions  rather  than  the  piecewise  constant  approximations  they  really 
are.  Keep  in  mind  the  actual  functional  form. 


Parameter/State  Estimation  -  Standard  Case 
K=  1  P=l.  Q=R=.01,  S=0.  N=24 


Parameter/State  Estimation  -  Standard  Case 
K=  2  P=l,  Q=R=.01,  S=0.  N=24 


Figure  5.3  sliows  the  prediction  density  for  another  case.  Here  a  large  measure¬ 
ment  noise  variance,  was  chosen  to  get  relatively  slow  convergence  of  the  density. 

For  this  case,  the  parameter  was  1,  the  initial  state  -0.5,  and  the  initial  density  Gaus¬ 
sian  centered  at  (1,0).  Figure  5.3a  is  a  mesh  plot  showing  the  density  as  a  three  dimen¬ 
sional  surface.  Although  it  is  not  as  obvious  as  in  the  last  plot,  it  is  still  apparent  that 
the  ridge  line  runs  at  an  angle  as  expected.  The  shallow  angle  is  to  be  expected  due  to 
the  small  initial  state.  Figure  5.3b  shows  a  contour  plot  of  the  log  of  the  same  density. 
In  this  plot  the  increasing  spread  of  the  density  as  increases  is  clear.  Figure  5.4 

shows  the  prediction  density  for  the  same  run  two  time  steps  later.  Despite  the  high 
noise,  there  has  been  considerable  convergence.  The  ridge  line  skew  is  less  noticeable  on 
the  mesh  plot  of  5.4a  than  the  last  time.  Looking  at  the  contour  plot  of  5.4b,  though, 
we  see  that  the  ridge  line  has  developed  a  definite  curve.  This  is  easily  explained  by 
thinking  again  of  the  intuitive  analysis.  Since  the  meeisurement  variance  is  large,  and 
the  input  variance  small,  for  a  first  approximation  we  can  neglect  the  first  and  third 
steps,  and  the  influence  of  the  known  input.  Now  start  with  a  ridge  running  parallel  to 
the  Z(,)  axis.  After  the  first  dynamics  update  the  ridge  follows  a  straight  line  through 
the  origin.  After  the  second,  it  follows  a  parabolic  arc,  since  we  mapped  (a,b)  into 
(tt,tt6)  then  into  (a,a*6).  After  the  third,  it  will  follow  a  cubic.  Including  the  neglected 
effects  disrupts  this  nice  progression,  but  the  basic  mechanism  is  there. 

Any  number  of  parametric  studies  are  possible  at  this  point.  It  is  possible  to 
look  at  the  effects  of  varying  the  noise  variances,  the  initial  state  and  parameter  values, 
or  the  initial  state  and  parameter  estimates.  We  can  explore  different  noise  and  initial 
state  densities,  such  as  uniform  instead  of  Gaussian,  and  look  at  their  effects.  The  pos¬ 
sibilities  are  virtually  limitless.  They  are,  however,  not  very  meaningful  without  a 
specific  application  in  mind.  Therefore,  we  will  not  pursue  this,  and  instead  encourage 
users  to  apply  the  technique  to  their  particular  problem. 

5.4.  Analysis  for  Other  Unknown  Parameters 

Ilavitig  gone  through  the  analysis  above  for  the  case  of  an  unknown  transition 
parameter,  it  is  quite  easy  to  extend  the  intuitive  analysis  to  other  situations.  In  this 
section  we  will  consider  four  other  possible  unknown  parameters,  and  do  a  brief  parallel 
analysis  of  the  estimation  problem  for  each. 


Figure  5.3  -  Parameter/state  prediction  density  for  Q=0.01 

R  =  l,  k=l. 
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First,  consider  the  case  of  an  unknown  scaling  on  the  input.  I'his  is  modeled  by 
I* ,  1  ^  Fz^  i  Eut  -+  uij 
2*  -  I*  e* 

where  the  parameter  E  is  unknown.  The  nonlinear  stale  model  is 

-  +  [ij  ‘i'i 

=  *(!),*  +  e*  ^  A(x*  ,  u*) 

The  grid  will  be  defined  as  in  the  previous  section.  The  inverse  image  of  a  grid  cell  is 
now  given  by 

=  {*(!)£  I“(i),*+i(j(i)-— j-tfciij.t+i .  “(i),*+i(j(i)+— )  +  <>(i),*+il ; 

•(.,£  I - ^7 - . - - 1} 

SO  the  inverse  image  is  a  parallelogram.  This  obviously  modifies  the  definitions  of  the 
limit  functions  <  and  ^  in  (5.1)  and  (5.2)  but  does  not  otherwise  affect  the  equations. 

The  density  behavior  for  this  case  is  somewhat  similar  to  that  for  the  first  case. 
Again  imagine  a  uniform  prior.  As  before,  the  measurement  density  is  a  straight  ridge 
parallel  to  the  iji)  axis  and  loco.ted  at  ijjpr,  and  it  is  reproduced  by  the  first  measure¬ 
ment  update.  The  system  dynamics  map  the  point  (o,6)  into  (a,Fa  +  bu),  so  the  dynam¬ 
ics  update  results  in  a  ridge  running  along  the  line  Z(2)=F’2-ti(,)U.  Note  that  the  spread 
of  the  density  in  the  zjj)  direction  does  not  depend  on  Z(i).  The  convolution  with  the 
input  noise  density  spreads  the  density  out  in  the  Z(2)  direction,  resulting  in  a  straight 
ridge  with  its  Z(2)  spread  independent  of  Z(,).  Note  the  contrast  with  the  last  case,  where 
the  Z(2)  spread  increases  with  Z(,j.  Processing  the  next  measurement  results  in  hump  at 
the  intersection  of  the  two  densities.  As  before,  the  degree  of  convergence  depends  on 
the  angle  between  the  two  ridge  lines,  but  in  this  case,  the  angle  depends  directly  on  the 
input,  not  on  the  state.  Zero  input  results  in  no  convergence  along  Z(|j,  as  would  be 
expected.  In  the  previous  case  we  wanted  to  have  inputs  to  keep  the  system  state  from 
getting  small;  here  we  must  have  inputs  to  get  any  convergence  at  all.  Also  note  that 
the  spread  of  the  densities  does  not  depend  on  Z(,),  so  there  is  no  variation  in  conver¬ 
gence  with  true  value  of  the  parameter. 


-1 

*(2).*  +  l 

Fl(2).t  +  *(!),*“* 

1 
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Next  consider  an  unknown  scaling  in  ilie  output  equation.  The  model  is 

2*  -  i/z*  ^  t* 

where  the  parameter  H  is  unknown.  The  nonlinear  state  model  is 


riM.m 


[*(2),*  +  lJ  [^^(2),*  “•  “*J 

•2*  ^  •1(1), i  ^(2),*  +  ^  A(rt  ,  e*) 


/(!*>“*)  +  1 


Note  that  this  model  has  linear  dynamics.  Using  the  same  grid  as  before,  a  simple  com¬ 
putation  yields  the  inverse  image  set 

“(2).*-fl(j(2)-  — )  +  i(2),*-H-“t  0(2),*-n(y(2)  +  -^)+l>(2),ifI-“t 

I - 1 -  . - 1: - 1} 

which  is  just  a  rectangle.  Again,  this  leads  to  modiHcation  of  the  limit  functions  in 
(5.1).  More  importantly,  we  must  also  change  n  because  of  the  change  in  the  measure¬ 
ment  equation.  Equation  (5.1)  becomes 

1  'j 

~  at., 


r  dz,„  f  dz(,) 


The  density  behavior  for  this  case  is  quite  different  than  the  first  two.  Again, 
we  will  start  with  a  uniform  prior.  The  measurement  density  defined  above  is  a  some¬ 
what  banana-shaped  ridge,  with  the  ridge  running  along  the  hyperbola  Z(j)=z/r(,p  The 
isocontours  follow  hyperboleis  in  all  four  quadrants.  The  dynamics  update  scales  the 
density  and  shifts  it  in  the  ijjj  direction,  and  then  the  convolution  with  the  input  noise 
density  spreads  it  out  in  the  ijj)  direction  independent  of  Z(,).  The  result  is  essentially  a 
shifted  version  of  the  measurement  density.  The  next  measurement  update  can  now  be 
roughly  visualized  as  the  multiplication  of  two  of  these  densities  offset  with  respect  to 
each  other.  If  there  is  little  offset,  there  will  be  relatively  little  convergence.  On  the 
other  hand,  if  there  is  a  large  offset,  the  isocontours  of  the  densities  will  cross  at  large 
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angles,  resulting  in  convergence  in  both  dimensions.  In  contrast  to  the  first  two  cases,  it 
is  the  nonlinear  nature  of  the  measurement  update  that  is  important,  rather  than  the 
dynamics.  A  few  moments  sketching  the  densities  for  different  situations  shows  that 
the  best  situation  is  one  with  large  oscillating  known  inputs.  At  the  extreme,  if  i/  =  0, 
then  there  is  no  convergence  in  X(,).  There  is  also  a  fairly  strong  dependence  on  the  true 
value  of  the  parameter.  Larger  true  ij,)  tends  to  amplify  the  effect  of  the  inputs,  and  so 
the  density  will  converge  faster. 

Now  look  at  unknown  scaling  on  the  measurement  noise.  The  model  for  this  is 

Xj  ,  1  ^  /’ij  -f  Uj  -t  u/* 

X*  =  -t 

where  the  parameter  D  is  unknown.  The  nonlinear  state  model  is 


^(1).*  .1 


[^(2).*  +  l]  (^"^(2).*  +  “*J 


^  /(!*>“*)  llj 


This  model  has  the  same  dynamics  as  the  last,  so  the  grid  and  inverse  image  set  are  the 
same.  The  measurement  noise  density  fi  is  changed,  however,  making  (5.1) 


lh , 


/  ‘^^(1)  /  *^^(2)  (2”^)  *exp(- 

fiim) 


U*~i(2))’ 


The  measurement  density  for  this  case  is  a  fan-shaped  ridge,  with  the  apex  at 
x  -(0,2),  and  the  ridge  line  running  along  i{2)-  2-  Starting  with  the  ubiquitous  uniform 
prior,  we  get  a  copy  of  this  density  after  the  first  measurement  update.  The  system 
dynamics  map  (a,b)  into  [a,Fb  tu),  resulting  in  a  scaling  and  shift  in  the  x^j  direction. 
'I’he  convolution  then  spreads  the  density  out  in  the  i(2)  direction.  The  resulting  density 
is  still  fan-shaped.  Now  when  the  next  measurement  is  processed,  we  multiply  by 
anothc-r  fan-shaped  ridge  with  a  parallel  ridge?  line.  The  result  is  yet  another  fan-shaped 
ridge,  with  no  convergence  in  the  X(,)  direction.  Jo  take  an  extreme  example,  suppose 
that  the  true  value  of  Xj,)  is  0.  Then  at  each  measurement  update  the  ridge  lines  of  the 
densities  would  line  up.  It  is  easy  to  see  that  the  density  will  not  converge  in  the  xpj 
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direction.  For  non-zero  ij,),  the  ridge  lines  will  be  offset.  This  will  tend  tt)  reduce  the 
density  for  smaller  values  of  i(|),  but  will  not  affect  the  higher  values.  In  this  case,  the 
parameter  is  not  identifiable,  and  this  is  easily  shown  by  considering  the  actual  density 
behavior  as  we  have  done. 

5.5.  Comparisons  of  Point  and  Bayes’  Estimates 

Let’s  now  look  at  the  final  use  mentioned  in  Section  5.1.  We  will  use  for  our 
candidate  point  estimator  a  minimum  variance  estimator  derived  by  Liang  and 
Christensen  |39j.  Liang  has  published  the  results  of  several  simulation  studies  compar¬ 
ing  the  filter  (referred  to  hereafter  as  the  MVF)  to  the  EKF  for  a  parameter  identifica¬ 
tion  problem  |40j.  In  his  paper  Liang  points  out  the  theoretical  difficulties  of  comparing 
suboptimal  nonlinear  filters,  and  concludes  that  only  detailed  numerical  simulations  can 
provide  the  basis  for  meaningful  comparisons.  This  is  undoubtedly  true,  but  Liang  mis- 
f,akenly  discounts  the  possibility  of  comparing  the  suboptimal  filters  to  the  true  optimal 
solution.  The  true  optimal  solution  is  available  for  this  case  as  the  mean  of  the  condi¬ 
tional  density  computed  via  the  algorithm  derived  in  this  paper  (for  simplicity,  we  will 
refer  to  the  conditional  mean  for  this  case  as  the  Bayes’  estimate).  What  we  will  do, 
then,  is  extend  the  analysis  of  [40]  to  include  the  true  optimum. 

In  his  paper,  Liang  explores  the  behavior  of  the  MVF  for  three  slightly  different 
discrete-time  system  models.  For  this  work  we  will  concentrate  on  the  second  of  the 
systems,  which  is  described  by  the  model 


*(«).* 

+ 

0 

=  /(**)  +  [il 

*(!),* 

+  V*  -  /l(zt  ,  v*) 

Note  that  the  unknown  parameter  r(,|  now  appears  as  a  scaling  factor  in  the  observa¬ 
tion  equation  as  well  as  in  the  dynamics,  and  the  model  does  not  include  a  known  input. 
This  model  does  not  seem  to  have  a  clear  physical  interpretation,  and  may  have  been 
chosen  simply  because  the  EKF  does  not  handle  it  well. 

The  equations  developed  in  Section  5.2  can  be  applied  almost  directly  to  this 
model.  The  mam  change  is  in  the  equations  containing  fi,  which  now  is  given  by 
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This  changes  (5.2)  to 


£,(•./) 


r  «ix(.)- 
e.r-.j) 


Except  for  the  removal  of  the  input  u^,  the  remaining  equations  in  the  section  do  not 
change. 

For  the  above  model,  Liang  looks  at  eight  different  combinations  of  initial  con¬ 
ditions  and  noise  variances.  We  will  consider  four  of  those  cases,  the  ones  for  which  the 
performance  of  the  MVF  and  the  EKF  differ.  For  all  cases,  the  true  parameter  value 
Z(i]  is  1.0,  the  initial  value  of  the  state  is  chosen  randomly  from  a  zero-mean  Gaus¬ 
sian  distribution,  and  the  initial  variances  for  the  state  and  parameter  are  1.0.  The  ini¬ 
tial  density  for  the  Bayes'  algorithm  is  bivariate  Gaussian,  centered  at  the  initial  esti¬ 
mate.  The  results  for  each  case  are  bcised  on  the  average  of  25  independent  Monte 
Carlo  runs  of  the  first  30  time  steps  (one  observation  per  time  step). 

The  first  two  cases,  shown  in  Figures  5.5  and  5.6,  are  for  small  noise  variances, 
Q =72 -0.01.  Figure  5.5  shows  the  results  for  an  initial  parameter  estimate  of  2.0.  For 
this  case,  Liang  showed  that  the  EKF  tended  to  diverge,  while  the  MVF  converged  to 
the  true  parameter  value  after  a  transient  phase.  The  results  here  show  the  MVF  con¬ 
verging  very  slowly,  if  at  all.  The  Bayes’  estimate,  though,  quickly  converges  to  the 
true  parameter  value.  The  Bayes  estimate  of  the  state  is  also  markedly  superior  to  the 
MVF  estimate.  Clearly,  there  is  a  wide  margin  between  the  performance  of  the  subop- 
timal  filter  and  the  true  optimal.  Figure  5.6  shows  the  result  of  making  the  initial 
parameter  estimate  0.1.  Liang  actually  used  an  initial  parameter  estimate  of  0  for  this 
second  case,  which  causes  the  EKF' to  not  respond  at  all,  and  adversely  affects  the  tran¬ 
sient  response  of  the  MVF.  Using  0.1  for  the  initial  parameter  estimate  seems  more  rea¬ 
sonable,  and  can  only  improve  the  MVF  response.  Liang’s  results  show  the  MVF  con¬ 
verging  on  average  to  a  parameter  estimate  of  approximately  0.6.  Figure  5.6  shows 
essentially  that  behavior  for  the  MVF,  perhaps  improved  slightly  since  it  appears  to  be 
moving  slowly  toward  the  true  value.  The  Bayes’  estimate,  on  the  other  hand,  con¬ 
verges  quickly  to  near  the  true  value.  Note  that  the  Bayes  estimate  jumps  from  the  ini¬ 
tial  estimate  to  near  the  true  value  after  the  first  observation.  This  rapid  convergence 
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Figure  5.5  -  Averaged  parameter/state  estimates  for  Liang's 
Case  II,  Q^R^O.Ol,  initial  parameter  estimate  2.0. 
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appears  to  be  typical  of  the  low  noise  cases,  as  w'ould  be  expected.  Even  though  the 
average  behavior  of  the  MVF  is  not  too  bad,  it  is  rather  inconsistent.  As  an  example, 
the  results  shown  in  Figure  5.7  turned  up  on  one  of  the  test  runs.  This  figure  shows  the 
fdter  performance  for  a  single  Monte  Carlo  run.  Here  the  MVF  actually  diverges,  while 
the  optimal  estimate  performs  essentially  as  the  average  shown  in  Figure  5.6. 
Throughout  the  individual  runs  that  were  reviewed,  the  Bayes’  estimate  performed  very 
consistently. 

In  addition  to  being  concerned  with  the  convergence  of  the  conditional  mean  to 
the  true  value,  we  are  interested  in  the  convergence  of  the  density  as  discussed  in  Sec¬ 
tion  5.3.  Even  if  the  mean  of  the  density  were  close  to  the  true  value,  we  would  have 
significantly  less  confidence  in  the  estimate  if  the  density  were  spread  out  than  if  it  were 
a  sharp  spike.  For  an  example,  consider  the  model  with  *he  noise  levels  and  initial  con¬ 
ditions  used  for  Figure  5.5.  Figure  5.8  shows  the  conditional  density  at  time  steps  1,  4, 
and  8  for  a  single  realization.  As  we  would  expect  for  a  low  noise  case,  there  is  very 
rapid  convergence  in  the  state  after  a  single  measurement.  A  more  measurements  are 
processed,  the  density  converges  in  the  parameter  direction  also.  In  general,  the  density 
is  clearly  converging  as  we  would  like  it  to.  This  behavior  was  also  seen  consistently  in 
the  other  individual  runs  that  were  reviewed. 

The  next  two  cases  show  the  results  of  larger  noise  variances,  Q=R=1.0,  with 
the  two  initial  conditions  used  above.  Figure  5.9  results  from  the  initial  parameter  esti¬ 
mate  2.0.  For  this  case,  Liang  shows  both  the  MVF  and  the  EKF  converging  fairly 
rapidly  to  the  true  parameter  value,  with  the  EKF  slightly  superior.  Here  the  MVF  and 
the  Bayes’  estimates  show  comparable  performance,  although  the  Bayes’  estimate  tends 
to  display  somewhat  smoother  convergence.  The  fairly  large  oscillations  in  the  MVF 
estimates  (particularly  the  state  estimate)  seen  in  this  figure  appear  commonly  in  the 
simulations.  Figure  5.10  shows  the  result  fur  an  initial  estimate  of  0.1.  Liang’s  results 
for  this  case  show  the  MVF  converging  to  an  estimate  of  about  0.5,  then  drifting  very 
slowly  toward  the  true  value.  The  results  here  show  that  the  Bayes’  estimate  converges 
more  quickly  and  closer  to  the  true  value  than  the  MVF.  Again,  note  that  the  Bayes’ 
estimate  is  smoother  than  the  MVF. 

These  results  allow  us  to  generalize  Liang’s  conclusions  as  follows. 
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Figure  5.9  -  Averaged  parameter /slate  estimates  for  Liang’s 
Case  li,  Q  =  R-1.0,  initial  parameter  estimate  2.0 
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1.  When  the  level  of  noise  inputs  is  large  enough  to  mask  the  effects  of  nonlineari¬ 
ties,  point  estimators  such  as  the  MVF  and  EKF  tend  to  perform  fairly  close  to 
the  optimum,  although  they  may  be  quite  sensitive  to  initial  conditions. 

2.  Although  the  MVF  is  claimed  to  be  ‘far  superior’  to  the  EKF  when  the  noises 

are  small  (so  that  the  effects  of  the  nunlinearities  are  not  negligible),  it  is  still 

far  from  the  optimal.  This  indicates  that  the  MVF  still  fails  to  capture  impor¬ 
tant  features  of  the  posterior  densities. 

As  a  final  point  in  this  section,  let’s  look  at  the  effect  of  including  a  known 
input  sequence.  As  noted  above,  Liang  did  not  include  this  factor  in  his  analysis.  It  is 
well-known,  however,  that  the  input  sequence  can  have  a  large  impact  on  the  perfor¬ 
mance  of  the  estimator.  Hence,  we  would  expect  based  on  theoretical  considerations 
that  including  it  would  improve  the  performance  of  the  estimators.  Figure  5.11  shows  a 
simulation  result  for  the  second  case  considered  above,  without  an  input  sequence.  Fig¬ 
ure  5.12  shows  the  result  with  a  zero-mean  Gaussian  sequence  with  variance  0.01,  so 
that  the  input  signal  to  noise  ratio  is  1.  The  same  noise  sequence  realizations  were  used 

in  both  simulations,  so  the  differences  are  due  solely  to  the  inclusion  of  the  known 

input.  The  improvement  in  the  performance  of  both  estimators  is  quite  remarkable. 
The  MVF  estimate  goes  from  essentially  nonconvergent  to  convergent.  The  Bayes’  esti¬ 
mate  converges  much  faster  and  much  closer  to  the  true  value.  Both  state  estimates 
show  the  same  type  of  improvement.  Of  course,  this  is  not  the  optimal  input  sequence, 
so  it  may  be  possible  to  get  even  more  improvement.  Even  so,  this  is  a  good  example  of 
the  performance  improvement  possible  when  an  input  sequence  is  available. 


Figure  5.12  -  Parameter/state  estimates  for  Liang’s  Case  II, 
with  known  input  sequence,  signal  to  noise  ratio  1.0. 


6.  APPLICATION  TO  BEARINGS-ONLY  TRACKING 


6.1.  Problem  Dermition 

Lets  turn  now  to  a  more  specific  problem.  The  bearings-only  tracking  problem 
is  a  very  important  practical  problem  that  arises  in  a  number  of  situations.  The  proto¬ 
typical  situation  is  a  sea-based  sonar  tracking  problem,  where  a  moving  ship  (the 
receiver)  is  listening  to  the  acoustic  output  of  another  moving  ship  (the  target  or 
source).  The  objective  is  to  estimate  the  speed  and  location  of  the  source  based  on  the 
noisy  measurements  of  the  bearing  of  the  sound  source  relative  to  the  receiver.  The 
geometry  of  this  problem  is  shown  in  Figure  6.].  The  source  is  located  at  the  point 
''^••■h  velocity  V,.  The  receiver  is  located  at  P,,  and  has  velocity  V,.  The 
angle  0  is  the  bearing  of  the  source  relative  to  the  receiver. 


It  is  important  to  point  out  that  the  system  is  only  observable  when  the  target 
has  constant  velocity  and  the  receiver  maneuvers  satisfy  certain  constraints  [41].  In 
particular,  the  system  is  not  observable  if  the  receiver  follows  a  constant  velocity  path. 
This  makes  at  least  some  sort  of  receiver  maneuver  mandatory  if  we  wish  to  completely 
characterize  the  target  position  and  velocity. 
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There  are  two  other  particular  points  to  made  about  this  problem.  First,  the 
problem  is  intrinsically  nonlinear,  and  hence  is  noi  addressable  by  conventional  linear 
estimation  techniques.  Second,  in  realistic  situations,  there  are  relatively  few  measure¬ 
ments  available  due  to  the  usually  short  contact  times.  Hence  the  transient  perfor¬ 
mance  of  an  estimator  is  of  critical  importance.  For  these  reasons,  we  would  expect  the 
Bayesian  approach  to  be  desirable. 

A  number  of  different  point  estimation  approaches  have  been  tried.  The 
extended  Kalman  filter  is  the  first  obvious  choice.  It  has  reasonable  performance  in 
many  cases,  but  is  prone  to  premature  covariance  collapse  and  solution  divergence,  due 
the  interaction  and  feedback  of  filter  errors  l42j.  Pseudo-linear  formulations  have  been 
suggested  to  avoid  this  difficulty,  but  the  resulting  filters  produce  biased  estimates  when 
used  with  noisy  data  |42|.  Petridis  has  developed  a  different  approach  based  on  a  parti¬ 
tion  algorithm  l43|.  This  algorithm  seems  to  avoid  the  EKF’s  divergence  problems,  but 
since  it  consists  of  a  number  of  filters  operating  in  parallel,  it  incurs  a  large  computa¬ 
tional  cost.  Yet  another  approach  is  to  work  in  other  than  the  ‘natural’  cartesian  coor¬ 
dinates.  Aidala  and  Harnmel  |44j  reformulated  the  problem  in  terms  of  modified  polar 
(MP)  coordinates  and  then  applied  the  EKF.  This  resulted  in  a  decoupling  of  certain 
terms  in  the  covariance  update,  so  that  covariance  collapse  would  not  occur.  The  prob¬ 
lem  h^ls  also  been  approached  from  the  Bayesian  viewpoint,  by  calculating  a  representa¬ 
tion  of  the  posterior  density.  Bucy  (45|  used  the  point  mass  approximation  to  look  at 
the  density  propagation  for  an  aircraft  based  problem.  Sorenson  |23]  used  his  p-vector 
approach  in  exploring  a  ship  based  problem,  and  used  the  peak  of  the  approximate  pos¬ 
terior  density  to  define  the  maximum  a  posteriori  estimator. 

In  this  section  we  will  apply  the  algorithm  developed  earlier  to  the  bearings- 
only  tracking  problem.  First,  we  will  analyze  the  density  propagation  from  a  qualitative 
point  of  view  first  to  try  to  get  some  additional  insight  into  the  problem.  Then  we  will 
compare  the  performance  of  the  conditional  mean  of  the  estimated  density  to  the  MP 
filter  for  several  example  scenarios  culled  from  the  literature. 

6.2.  Algorithm  Development 

The  first  step  is  to  define  the  system  equations  which  we  will  use.  As  noted 
above,  there  are  several  possible  formulations,  each  leading  to  different  equations.  The 
choice  of  reference  frame  is  extremely  important  in  designing  point  estimators  and  has 
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been  demonstrated  to  greatly  affect  the  ultimate  Filter  performance.  This  is  because  the 
errors  from  the  approximations  used  in  the  filters  depends  heavily  on  the  particular 
form  of  the  system  equations.  These  choices  are  rather  less  important  for  the  Bayesian 
formulation,  since  there  are  no  approximations  being  made.  It  is  of  some  importance  in 
applying  the  algorithm  of  this  paper,  since  the  choice  of  state  will  affect  the  computa¬ 
tional  complexity  of  the  algorithm. 

The  most  commonly  used,  and  conceptually  simplest,  form  is  bcised  on  cartesian 
coordinates.  The  main  advantages  of  this  formulation  are  that  it  results  in  linear 
dynamics  and  that  it  is  easy  to  visualize.  The  state  vecto.  !  defined  as  the  position  and 
velocity  of  the  source  at  time  step  k,  so' 


where  6  is  the  time  between  steps.  Note  the  velocity  does  not  depend  on  time,  since  we 
are  assuming  a  constant  velocity  source.  The  system  equations  are  then 
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z,  =  tan 
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h*  (**.»?*) 


where  (p,  ,(l:6),p,^(t6))  is  the  position  of  the  receiver  at  time  step  ifc,  that  is,  1=46.  The 
observation  noise  ri,  is  assumed  to  be  zero-mean  white  Gaussian  noise  with  variance  R. 
Note  that  there  is  no  noise  term  in  the  system  dynamics.  This  is  for  consistency  with 
previous  work,  and  is  not  a  necessary  assumption  for  the  Bayesian  approach.  Because 
there  is  no  plant  input  noise,  the  above  can  be  rewritten  as  an  equivalent  identification 
problem  in  terms  of  the  initial  position  of  the  source.  This  results  in  the  alternative  for¬ 
mulation 


*  '  ‘  |p.,.|0)  P,,»(0)  V,  ,  v,.,  I*" 

+  V*  =  ',17* 


1.  I  apologixe  for  the  clash  between  the  convention  o(  using  z  as  the  state  variable,  and  that  of  us¬ 
ing  X  as  the  horisontal  (east)  coordinate.  The  context  should  make  it  clear  which  meaning  is  in¬ 
tended. 


,  z'|i)-t46z  '(s)-p,.,(46) 
i  ',j, -1-1:6*  ',*,-p,_,(46) 
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where  p,  ,,  p,  ,,  and  v*  as  above.  This  formulation  has  even  simpler  dynamics  than 
the  one  above. 

it  is  also  possible  to  reformulate  the  problem  so  that  the  model  has  nonlinear 
dynamics,  but  is  linear  iti  the  observation.  This  is  the  approach  used  by  Aidala  and 
Harnmel  for  their  MP  filter.  In  terms  of  the  algorithm  in  this  paper,  however,  this 
tradeoff  of  observation  nonlinearity  for  system  nonlinearity  is  not  appealing.  Nonlinear¬ 
ities  in  the  system  dynamics  can  make  the  inverse  image  set  very  unwieldy,  and  make 
the  calculation  of  equation  (3.14b)  considerably  more  difficult.  Unless  the  change  also 
results  in  a  corresponding  reduction  in  the  complexity  of  the  integrand,  it  would  not  be 
worthwhile.  In  this  case  it  does  not,  so  we  will  concentrate  on  the  cartesian  formula¬ 
tion.  Of  the  two  above  formulations,  the  second  has  the  simpler  dynamics  (basically 
none),  and  since  they  are  otherwise  equivalent,  we  will  use  it. 

Having  selected  the  model,  we  can  now  develop  the  equations  to  implement  the 
algorithm.  The  first  step  is  to  define  the  grid.  The  easiest  grid  to  use  is  based  on 
hyper-rectangles  defined  by 

*  {*(<)€ 

This  is  definitely  not  to  imply  that  this  is  the  optimum  choice  for  the  grid.  It  is  certain 
that  other  grid  definitions  could  conform  more  closely  to  the  density  shapes  and  hence 
increase  the  efficiency  of  the  algorithm  by  minimizing  the  number  of  grid  cells  with 
non-negligible  mass.  For  clarity  of  presentation,  however,  this  simple  definition  is  prob¬ 
ably  best. 

The  next  step  is  to  determine  the  inverse  image  set  defined  by  equation  (3.14a). 
Since  there  are  no  system  dynamics  in  this  case,  the  inverse  image  set  Hy  is  simply  ly 
itself.  This  makes  the  limits  for  the  integral  in  equation  (3.14b)  extremely  simple.  If 
the  grid  has  not  been  changed  from  the  last  time  step,  the  two  grid  systems  coincide 
exactly  and  (3.14b)  becomes 

If  we  wish  „o  improve  the  efficiency  of  the  algorithm,  we  can  make  the  grid  dynamic  so 
that  it  can  conform  to  the  current  density.  If  this  is  done,  the  intersection  will  have  to 
be  calculated  in  order  to  evaluate  (3.14b).  If  the  new  grid  is  also  based  on  hyper¬ 
rectangles,  then  the  intersection  of  two  cells  is  also  a  hyper-rectangle  with  easily 


calculated  edges.  Turning  now  to  the  integrand,  we  recall  that  M=p(z\  z),  so  that  for 
this  problem 
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^'ou  should  note  that  this  makes  the  integral  above  rather  complicated,  even  for  the 
simple  case  of  a  fixed  grid.  Evaluating  this  integral  is  the  primary  computational  bur¬ 
den  in  implementing  the  algorithm  for  this  problem,  but  little  effort  was  put  into  optim¬ 
izing  the  coding  for  its  evaluation. 


As  pointed  out  earlier,  there  is  no  system  input  noise  for  this  problem.  This 
eliminates  the  need  for  equation  (3.14c).  Note  that  this  also  negates  one  of  the  benefits 
of  the  algorithm  presented  here;  since  the  convolution  does  not  enter  the  problem,  we 
get  no  benefit  from  being  able  to  use  the  FFT  technique  to  evaluate  it.  Despite  this,  it 
is  still  worthwhile  to  apply  this  algorithm  with  the  expectation  of  getting  a  reasonable 
approximation  to  the  actual  posterior  density. 


Finally,  equations  (3.14d)  and  (3.14e)  are  implemented  directly  to  normalize  the 
updated  density. 


6.3.  Qualitative  Density  Analysis 

Often  we  can  learn  a  great  deal  by  looking  qualitatively  at  the  density  behavior 
with  time,  motivated  by  an  enlightened  consideration  of  the  problem  from  the  Bayesian 
viewpoint.  Qualitative  analysis  is  largely  visualization  of  the  density,  and  imagining  the 
effects  in  different  situations.  Visualization  of  the  d>-nsity  was  fairly  easy  in  the  last 
chapter,  where  the  density  was  a  simple  surface  in  three-spate.  Visualization  for  this 
problem  is  much  harder,  since  most  people  aren’t  equipped  to  imagine  surfaces  in  five 
dimensional  space.  Two  possible  alternatives  to  help  with  the  visuaii^ation  come  to 
mind;  we  can  consider  ‘slices’  of  the  density,  or  look  at  the  marginal  densities.  For 
slices,  we  fix  two  of  the  four  stale  variables  and  look  at  the  resulting  surface  in  three- 
spacc.  For  the  marginal  densities,  we  usually  consider  the  joint  density  of  two  of  the 
states,  so  we  again  have  a  surface  in  three  spare.  The  problem  with  looking  at  the  mar¬ 
ginals  is  that  they  tend  to  obscure  the  fine  detail  of  the  density.  Marginals  are  like  pro¬ 
jections  of  the  density  onto  subspaces,  so  it  is  somewhat  like  trying  to  guess  the  shape 
of  an  object  from  its  shadows.  Dealing  with  only  slices,  though,  still  leaves  the  problem 
of  integrating  the  individual  visualizations  to  provide  an  understanding  of  the  entire 
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density.  Neither  approach  is  entirely  satisfactory,  so  both  may  be  necessary  to  describe 
the  density  behavior. 

We  begin  by  reviewing  the  measurement  update  function  I'he  peak  of  the 
function  (where  the  argument  of  the  exponential  is  zero)  lies  along  the  hyperplane 
defined  by 

X,,)  T  P,,,(fci) 

'  - n - rnr 

1(2,  +  *<>1(4)  Pr.,!**) 

and  contours  of  the  density  lie  along  the  related  hyperplanes 


ta.n(x  -t-e ) 
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1(2)  +  ii*(«)  -  P,.t(k6) 


where  e  determines  the  density  value  on  the  contour.  For  a  particular  velocity  (fixed 
Z(s)  and  Z(4)),  the  image  of  the  density  in  position  space  (x(i)XX(2))  is  &  ridge  which  gets 
wider  as  it  gets  further  away  from  the  origin.  The  peak  is  along  the  line 


where  C  depends  on  the  fixed  velocities,  the  measurement,  and  the  piosition  of  the 
receiver.  So  for  each  fixed  velocity  slice,  the  angle  of  the  ridge  with  respect  to  the  zjij 
axis  is  the  same,  but  the  intercept  changes  from  slice  to  slice.  Roughly  speaking,  the 
value  of  /i  at  a  point  represents  the  likelihood  that  a  target  starting  from  that  point 
with  the  fixed  velocity  could  be  responsible  for  the  given  observation. 


With  this  in  mind,  lets  look  at  the  propagation  of  the  density  in  position  space 
for  a  fixed  velocity.  As  usual,  start  with  a  uniform  initial  density.  At  the  first  time 
step  (t=^0),  C=0,  so  the  fi  ridge  line  runs  through  (0,0)  along  the  first  observed  bearing 
line.  Since  the  prior  is  uniform,  the  result  of  the  measurement  update  is  a  duplicate  of 
fi.  Note  that  fi  will  run  along  only  the  positive  range  half  of  the  bearing  line.  For  the 
second  measurement,  C  is  nonzero  and  presumably  z  is  different  than  the  first,  so  the 
ridge  of  the  prior  and  the  ridge  of  n  intersect.  The  multiplication  of  the  measurement 
update  results  in  a  hump  at  the  intersection.  The  spread  will  depend  on  the  width  of 
the  ridges  at  the  intersection  and  on  the  angle  of  the  intersection.  The  location  of  the 
hump  identifies  the  appropriate  starting  point  for  a  target  with  the  given  velocity  to 
meet  the  observations.  With  the  third  measurement,  we  get  yet  another  C,  z,  and  ft. 
This  time,  the  n  ridge  may  or  may  not  align  with  the  peak  in  the  prior.  If  it  does  not, 
the  result  is  attenuation  of  the  peak.  This  indicates  that  fixed  velocity  is  not 
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compatible  with  the  observalioiis.  The  lota)  probability  associated  with  this  slice  will 
decline.  If  p  does  align  with  the  peak  in  the  prior,  the  peak  is  accentuated.  In  this 
way,  certain  velocity  slices  will  build  up  peaks  at  particular  locations.  The  height  of  the 
peak  and  its  spread  is  indicative  of  the  likelihood  that  the  velocity  associated  with  that 
slice  could  be  associated  with  the  observations.  The  location  of  the  peak  gives  the 
rec]uired  initial  location  for  that  velocity  to  have  given  those  measurements.  Taken 
together,  the  peaks  for  all  the  velocity  slices  form  a  line  in  the  state  space,  giving  a  set 
of  likely  initial  position/ velocity  pairs. 

When  the  receiver  maneuvers,  it  changes  the  progression  of  C  and  z.  If  we 
were  starting  from  a  uniform  prior,  this  would  build  up  a  different  line  through  the 
state  space.  Instead,  we  are  starting  with  the  density  resulting  from  all  the  observations 
before  the  maneuver.  In  some  velocity  slices,  the  peaks  from  the  before  and  after 
maneuver  observations  will  be  far  apart.  This  indicates  that  there  is  not  likely  that  a 
single  starting  point  could  explain  both  sets  of  observations.  In  those  slices,  both  peaks 
will  be  attenuated.  In  others,  the  peaks  will  be  close  together,  resulting  in  an  accen¬ 
tuated  peak.  In  this  way  the  most  likely  of  the  position/velocity  pairs  are  picked  out. 

In  terms  of  the  entire  density,  we  can  imagine  the  first  n  as  centered  on  a  verti¬ 
cal  plane.  Successive  fis  are  rotated  with  respect  to  the  first,  but  all  intersecting  along  a 
line.  The  receiver  maneuver  changes  the  rotation  so  that  the  intersection  of  succeeding 
fiH  is  along  a  different  line.  The  intersection  of  these  two  lines  determines  the  eventual 
peak  of  the  full  density.  Actually,  this  is  the  idealized  behavior.  The  presence  of  noise 
perturbs  the  observations  from  the  natural  progression,  so  that  the  hyperplanes  no 
longer  intersect  along  a  single  line.  For  some  noise  sequences,  this  can  mimic  the  effect 
of  a  small  receiver  maneuver  by  appearing  to  produce  two  distinct  intersecting  lines. 
This  can  result  in  apparent  convergence,  or  false  observability.  These  effects  usually 
average  out,  but  can  give  temporary  spurious  peaks  in  the  density,  particularly  before 
the  maneuver. 

The  behavior  in  fixed  position  slices  is  the  same.  For  a  fixed  position,  the  den¬ 
sity  in  velocity  space  converges  to  the  velocity  most  likely  to  have  produced  the  given 
observations  starting  from  the  fixed  position.  If  the  given  position  is  not  compatible 
with  the  measurements,  the  density  in  that  slice  flattens  out  and  goes  to  zero. 

It  is  now  fairly  easy  to  see  what  the  marginal  densities  will  look  like.  After  the 
first  measurement,  the  marginal  position  density  will  be  a  replica  of  the  first  ft,  since 
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each  fixed  velocity  slice  is  identical.  'I’he  marginal  velocity  density,  though,  will  still  be 
uniform.  As  measurements  are  processed,  the  marginal  densities  will  converge  to  ridges, 
with  peaks  along  the  projections  of  the  line  that  the  peak  of  the  entire  density  is  on. 
After  the  maneuver,  the  marginal  densities  will  begin  to  converge  to  peaks.  Note  that  it 
is  not  appropriate  to  imagine  the  marginal  density  as  the  product  of  the  before 
maneuver  and  after  maneuver  marginal  densities  as  we  did  with  the  entire  density. 

What  conclusions  can  we  draw  from  this  analysis?  Two  come  quickly  to  mind. 
First,  we  see  the  perhaps  obvious  relationship  between  distance  and  convergence.  Look¬ 
ing  again  at  a  fixed  velocity  slice,  we  note  that  the  spread  of  ^  in  the  vicinity  of  the 
p>eak  is  proportional  to  the  range  to  the  target  for  the  current  measurement  for  a  target 
with  that  initial  location  and  velocity.  Thus,  we  will  get  better  convergence  in  the  true 
velocity  slice  if  we  can  maneuver  to  keep  the  range  to  the  target  at  a  minimum  for  all 
the  measurements.  As  a  corollary,  this  says  that  convergence  will  be  better  for  a  target 
that  happens  to  be  closer,  than  one  which  is  far  away.  Second,  it  provides  us  a  poten¬ 
tial  criterion  for  picking  better  maneuvers.  Recall  the  second  measurement  update 
described  above.  We  noted  that  the  convergence  of  the  peak  in  the  fixed  velocity  slice 
depended  on  the  angle  of  intersection  of  the  two  ridge  lines.  If  the  two  are  nearly  paral¬ 
lel,  convergence  will  be  mostly  perpendicular  to  the  original  ridge  line.  If  they  are  at 
right  angles,  the  convergence  will  be  mostly  along  the  original  ridge  line.  This  suggests 
that  we  should  try  to  maneuver  so  as  to  get  a  wide  range  of  bearings,  to  ensure  conver¬ 
gence  in  all  directions.  The  other  effect  that  a  maneuver  has  is  in  shifting  the  mutual 
intersection  line  of  the  succeeding  (is.  For  best  convergence,  we  would  like  the  new  line 
to  be  skewed  away  from  the  old  line  as  much  as  possible.  Although  it  has  not  been 
explored  in  this  paper,  it  is  possible  that  strategies  could  be  developed  with  this  in 
mind. 


6.4.  Comparison  of  the  Conditional  Mean  to  a  Point  Estimator 

As  we  discussed  in  the  last  chapter,  one  use  of  the  Bayesian  approach  is  in 
describing  the  behavior  of  the  true  optimal  estimate  for  comparison  to  suboptimal  or 
approximate  estimators.  In  this  section  we  compare  the  mean  of  the  conditional  density 
to  Aidala  and  Hammel's  MP  filter.  As  in  the  last  chapter,  the  point  estimator  is 
intended  to  approximate  the  minimum  variance  estimate  of  the  state.  The  mean  of  the 
posterior  density,  on  the  other  hand,  is  the  minimum  variance  estimate. 
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The  first  issue  in  using  our  approach  is  how  many  points  to  use  in  the  grid. 
Some  preliminary  experiments  were  run  using  between  six  and  ten  points  on  eeich 
dimension.  All  the  runs  used  the  same  g€!ometry  and  random  number  seed  to  insure 
comparability.  The  runs  showed  continued  change  as  the  number  of  points  increased, 
although  the  size  of  the  change  decreased  and  was  fairly  small  between  the  nine  and  ten 
point  runs.  This  coupled  with  the  rapidly  increasing  computational  and  storage  require¬ 
ments  suggested  that  nine  or  ten  points  on  each  dimension  would  be  satisfactory.  Since 
we  are  not  using  a  particularly  fine  grid,  it  is  worth  emphasizing  that  this  section  is 
really  a  comparison  of  two  approximate  techniques.  The  approximate  density  is  a  close 
approximation  to  the  true  density,  but  it  is  not  an  exact  representation. 

In  this  comparison,  we  will  look  at  four  different  scenarios  (target/receiver 
geometries)  gathered  from  other  papers  on  the  bearings-only  problem.  For  each 
scenario  we  will  present  typical  individual  runs.  Each  run  covers  16  minutes,  with  one 
sample  per  minute.  The  receiver  always  starts  at  coordinates  (0,0)  and  proceeds  at  con¬ 
stant  velocity  for  the  first  8  time  steps.  Immediately  after  the  eighth  observation,  the 
receiver  executes  an  instantaneous  maneuver  to  a  new  velocity,  and  maintains  that  velo¬ 
city  until  the  end  of  the  run.  The  target  starts  at  a  specified  position,  and  proceeds  at 
constant  velocity  for  the  entire  run.  All  distances  are  in  meters,  and  speeds  in  meters 
per  minute.  The  first  plot  of  each  set  shows  the  target  and  receiver  tracks  for  the  entire 
run,  and  the  estimated  target  tracks  for  time  steps  nine  through  sixteen.  The  second 
figure  of  each  set  shows  the  estimator  errors  versus  time.  The  errors  are  plotted  as  per¬ 
centage  error  in  range  to  target  and  speed  of  target,  and  degree  error  in  bearing  to  tar¬ 
get  and  target  heading. 

Filter  initialization  is  always  a  difficulty.  The  MP  filter  was  initialized  in 
accord  with  the  recommendations  in  Aidala  and  Hammel’s  paper,  with  the  exception 
that  the  initial  range  estimate  was  adjusted  to  near  the  true  initial  range  instead  of 
being  fixed  at  10  km.  The  initial  density  for  the  Bayes’  algorithm  was  uniform  over  the 
initial  grid,  which  in  turn  was  chosen  to  include  the  true  position.  Aside  from  the 
requin  nient  that  the  initial  grid  must  contain  the  true  initial  conditions,  it  is  clear  that 
we  want  to  pick  a  small  grid  to  increase  the  resolution  in  the  main  region  of  the  density. 
What  may  not  be  clear  is  that  we  also  need  the  grid  to  be  large  enough  to  cover  all  the 
density,  so  that  the  mean  is  accurate.  With  an  adjustable  grid,  it  is  usually  better  to 
err  on  the  large  side  and  let  the  grid  adapt  to  the  density. 


The  first  two  examples  are  from  Aidala  and  Harnmel’s  paper  that  introduced 
the  MP  filter  |44|.  The  first  is  their  long  range  scenario.  The  target  is  initially  at 
P,(0)  =  (24688,0)  with  velocity  V',  (617,0).  The  receiver  begins  with  velocity  P, “(617,617) 
and  has  V',  ^(617,  617)  after  the  maneuver.  The  noise  variance  is  H=^\A  degrees 
squared.  The  initial  grid  for  the  Bayes’  algorithm  is  defined  by 

{j„)t  (15000,40000)  ;  z,j)6  (-2000,2000)  ;  (100,1000)  ;  (-400,400)} 

This  is  a  somewhat  unrealistic  example,  since  the  receiver  is  managing  to  follow  the  tar¬ 
get  at  same  e,  velocity.  Figure  6.2  shows  the  tracks  for  this  example.  The  two  esti¬ 
mates  show  similar  performance.  Turning  to  the  error  plots  in  Figure  6.3,  we  see  that 
the  Bayes’  mean  has  somewhat  better  range  and  speed  estimates,  and  essentially  equal 
bearing  and  heading  errors.  This  is  not  surprising,  since  the  bearings  to  the  target  a 
fairly  small  (note  the  different  x  and  y  scales)  making  the  linearization  of  the  EKF  more 
accurate. 

The  second  example  is  Aidala  and  Hainmel’s  short  range  scenario.  It  is  the 
same  as  above  except  that  the  target  is  initially  at  P,  (0)  =  (2468,0),  and  the  initial  Bayes 
grid  is  defined  by 

{z(,,e  (1500,8000)  ;  (-300,300)  ;  z,,,€  (100,1000)  ;  (-800,300)} 

We  have  two  examples  for  this  scenario,  differing  only  in  the  specific  noise  sequences 
used.  In  the  first  (Figures  6.4  and  6.5),  the  MP  filter  manages  to  maintain  a  good  esti¬ 
mate  of  the  bearing  to  the  target,  but  otherwise  does  quite  poorly.  The  Bayes’  mean  on 
the  other  hand  does  very  well.  The  data  for  time  step  eight  (not  graphed),  shows  that 
both  estimators  had  comparable  estimates  at  that  time,  with  the  MP  filter  having  a 
somewhat  larger  range  estimate.  After  the  ninth  measurement,  both  reduced  the 
estimated  range,  but  the  MP  filter  overcompensated  and  could  not  recover.  In  the 
second  example  (Figures  6.6  and  6.7),  we  see  a  more  extreme  case.  Here,  the  MP  filter 
cannot  even  maintain  lock  on  the  target  bearing.  The  Bayes’  mean  begins  with  a  worse 
e.stimate  than  the  previous  case,  but  converges  anyway. 

The  behavior  of  the  Bayes’  mean  is  clearer  if  we  look  at  the  approximate  den¬ 
sity.  Figure  6.8  shows  the  marginal  initial  position  density  at  time  steps  eight,  ten,  and 
twelve  for  the  second  of  the  above  cases.  At  time  step  eight,  the  density  has  converged 
along  the  original  bearing  line,  but  not  in  range.  According  to  our  discussion  in  the  pre¬ 
vious  section,  though,  the  height  of  the  ridge  should  be  nearly  constant,  but  it  is  not. 
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Figure  6.3  -  Comparison  of  estimator  errors  for  Aidala  and 
HammePs  long  range  scenario. 
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This  is  a  result  of  the  algorilhrn  averaging  fi  over  fixed  size  cells.  Recall  that  in  a  fixed 
velocity  slice  is  a  wedge  shaped  ridge  with  a  constant  height  along  the  ridge  line.  The 
average  over  a  cell  near  the  origin  on  the  ridge  line  will  be  small,  since  rolls  off 
quickly.  On  the  other  hand,  the  average  over  a  cell  away  from  the  origin  will  be  larger, 
since  fi  remains  near  its  maximum  over  a  larger  area.  This  effect  is  predicted  by  the 
earlier  error  analysis,  i.e.,  the  error  in  the  approximation  increases  when  the  density  is 
sharply  peaked  in  relation  to  the  grid  size.  Also  note  that  the  peak  of  the  density  is  at 
the  extreme  edge  of  the  grid,  implying  that  the  grid  does  not  cover  the  entire  density. 
This  is  to  be  expected,  since  the  actual  density  is  infinite  in  the  range  direction  at  this 
point.  The  loss  of  mass  is  not  a  problem  in  this  application,  since  we  are  not  performing 
the  convolution  step  (remember  we  have  no  plant  noise).  Hence,  the  prediction  density 
has  only  local  dependencies  and  is  not  affected  by  the  loss  of  mass.  Following  the 
maneuver,  however,  the  density  begins  to  converge  in  range,  and  by  time  step  ten  we 
see  marked  convergence.  Note  that  the  grid  almost  entirely  covers  the  density.  There 
appears  to  be  some  mass  left  off  at  the  far  edge,  so  the  calculated  mean  range  is  prob¬ 
ably  a  bit  on  the  short  side.  By  time  step  twelve,  the  density  has  converged  enough 
that  the  grid  effectively  covers  it  entirely,  and  it  exhibits  a  strong  central  peak.  There 
is  also  a  noticeable  secondary  peak,  but  it  is  gone  by  the  fourteenth  time  step. 

The  next  scenario  comes  from  Sorenson’s  p-vector  paper  |23).  The  target  is  ini¬ 
tially  at  P,  (0)  =  (2164,-190)  with  velocity  F,  =  (-280,310).  The  receiver  begins  with  velo¬ 
city  V,  =  (63,94)  and  turns  to  V',  =  (-94,63)  after  the  maneuver.  The  noise  variance  is 
P=0.5  degrees  squared.  The  initial  grid  for  the  Bayes’  algorithm  is  defined  by 

(1500,3200)  :  Z(,)t  (-800,800)  ;  Z(,,e  (-600,100)  ;  Z(«)6  (-100,600)} 

Figures  6.9  and  6.10  display  the  results  for  this  case.  The  MP  filter  gets  a  good  esti¬ 
mate  of  the  true  bearing  to  the  target,  but  does  very  poorly  in  all  other  regards.  The 
Bayes’  mean  does  much  better,  but  still  not  terribly  well.  Figure  6.11  shows  the  density 
for  ihi.s  case  at  time  steps  nine  and  sixteen.  The  density  at  step  nine  (first  measure¬ 
ment  after  the  maneuver),  shows  some  convergence  in  range,  but  the  very  little  over  the 
following  measurements.  This  is  due  in  part  to  the  fact  that  the  point  on  closest 
approach  occurs  at  about  the  sixth  time  step,  well  before  the  maneuver.  Hence,  the 
spread  of  the  measurement  density  near  the  true  range  is  getting  wider  with  each  meas¬ 
urement,  severely  curtailing  the  possible  convergence. 


Bearings-Only  Estimation  -  Vehicle  Tracks 
Sorenson  P— Vector  Example 


Figure  6.9  -  Vehicle  tracks  for  Sorenson’s  p-vector  example. 


Sorenson  P— Vector  Example 
Marginal  Initial  Position  Density  -  k=9 


Sorenson  P-Vector  Example 
Marginal  Initial  Position  Density  -  k=16 


Figure  6.11  -  Marginal  initial  position  densities  for 
Sorenson’s  p-vector  example,  time  steps  nine  and  sixteen. 
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The  last  example  is  from  a  paper  by  Aidala  on  Kalman  filler  behavior  [42).  The 
target  is  initially  at  P,(0)^(4383,-383)  with  velocity  (-482,575).  The  receiver  begins 
with  velocity  I'',  =  (98,169)  and  turns  to  ^,=  1  177,82)  after  the  maneuver.  The  noise 
variance  is  P  9.0  degrees  squared.  The  initial  grid  for  the  IJayes’  algorithm  is  defined 
by 

{i|,|C  (3000,6000)  ,  (  1000,100)  ;  z,5|L  (-600,100)  ;  i,4,t  (0,800)} 

This  is  a  particularly  difficult  scenario  due  to  the  long  range  and  high  observation  noise 
level.  As  can  be  seen  in  Figures  6.12  and  6.13,  the  MP  estimate  diverges  for  this  case. 
At  time  step  eight  (not  graphed),  immediately  before  the  maneuver,  the  MP  filter  esti¬ 
mate  is  fairly  close  to  the  true  value,  though  not  eis  close  as  the  Bayes’  mean.  The  first 
measurement  after  the  maneuver  causes  the  MP  range  estimate  to  get  very  large.  The 
next  measurement  induces  such  a  large  residual  that  the  range  estimate  goes  negative, 
and  the  MP  filter  then  locks  onto  this  ‘negative’  track,  giving  a  bearing  error  of  roughly 
180°.  To  check  whether  this  behavior  was  due  to  the  particular  noise  sequence,  the 
MP  filter  was  rerun  ten  times  using  different  random  number  sequences.  The  same 
behavior  was  exhibited  for  all  runs,  indicating  that  it  is  a  result  of  the  geometry  or  the 
initialization  of  the  MP  filter.  The  Bayes’  mean,  on  the  other  hand,  does  well,  although 
not  as  well  as  the  graph  of  the  tracks  would  seem  to  indicate  due  to  scale  of  plot. 

These  results  confirm  the  advantages  of  using  the  conditional  mean  of  the 
approximate  density  as  an  estimator.  The  consistently  good  performance  of  the  mean  in 
turn  confirms  a  reasonable  degree  of  accuracy  in  the  approximate  density  generated  by 
the  algorithm.  These  results  also  indicate  that  the  MP  filter  does  not  live  up  to  theoret¬ 
ical  limits,  since  it  does  not  even  match  the  mean  of  the  approximate  density.  We  also 
see  that  the  MP  filter  has  extremely  poor  performance  for  at  least  some  geometries. 

The  good  results  from  the  approximate  Bayes’  mean  come  at  a  price,  however. 
The  algorithm  took  roughly  50  seconds  of  central  processing  time  on  a  CDC  CYBER 
computer  per  time  step  for  a  10x10x10x10  grid.  (Even  so,  this  is  roughly  real  time, 
since  most  sea-based  applications  have  one  minute  time  steps.)  The  most  time  consum¬ 
ing  portion  of  of  the  algorithm  is  the  integration  of  the  measurement  density.  In  fact, 
approximately  90%  of  the  program  run  time  is  spent  in  the  integration  routine.  Since 
no  vigorous  attempt  was  made  to  optimize  this  part  of  the  program,  it  is  reasonable  to 
expect  that  it  could  be  improved  considerably.  The  program  could  also  be  speeded  up 
by  taking  advantage  of  the  considerable  parallelism  in  the  algorithm.  Even  if  the 
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7.  SUMMARY  AND  CONCLUSIONS 

This  dissertation  lias  presented  the  derivation  and  application  of  a  new  algo¬ 
rithm  for  recursively  generating  an  approximation  of  the  conditional  state  density  for  a 
nonlinear,  discrele-tiine  system. 

The  algorithm  is  derived  by  taking  the  formal  recursive  equations  for  the  condi¬ 
tional  density  and  consistently  using  piece-wise  constant  approximations  for  the  various 
constituent  densities.  The  resulting  algorithm  is  given  by  equation  set  (3.14).  As  an 
interesting  aside,  the  algorithm  can  also  be  derived  by  considering  the  recursive  calcula¬ 
tion  of  the  probability  mass  associated  with  each  grid  cell.  In  other  words,  the  algo¬ 
rithm  can  also  be  thought  of  as  a  probability  mass  filter. 

There  are  two  main  advantages  to  this  algorithm.  First,  the  time  update 
becomes  a  discrete  linear  convolution,  instead  of  a  nonlinear  convolution  as  in  other 
algorithms.  This  allows  the  convolution  to  be  evaluated  using  FFT  techniques,  costing 
0(3N/b(N)  +  N)  operations  instead  of  t>(N*)  operations,  where  N  is  the  number  of  grid 
points.  Second,  the  algorithm  allows  a  thorough  analysis  of  the  propagation  of  the  error 
in  the  approximate  density.  The  error  analysis  shows  that  the  increase  in  the  error  at 
each  iteration  is  bounded,  although  the  bound  is  not  very  small.  More  importantly,  the 
analysis  yields  a  characterization  of  the  situations  where  large  error  growth  could  be 
expected.  This  characterization  indicates  that  we  can  expect  the  error  growth  to  be  far 
less  than  the  bound  in  most  cases  of  interest.  The  expected  error  stability  was  demon¬ 
strated  by  extensive  comparisons  of  the  approximate  density  to  the  true  density  for  a 
linear  time-invariant  system  with  Gaussian  noises.  The  results  demonstrate  the  stabil¬ 
ity  and  accuracy  of  the  algorithm  over  a  wide  range  of  conditions. 

For  the  next  two  applications,  the  algorithm  was  considered  as  a  tool  in  imple¬ 
menting  a  Bayesian  approach.  Two  common  but  problematic  nonlinear  systems  were 
considered.  The  first  was  the  case  of  a  scalar  linear  system  with  an  unknown  parame¬ 
ter.  The  objective  was  to  estimate  both  the  state  and  the  parameter  simultaneously. 
'I'he  second  case  was  that  of  a  moving  receiver  taking  noisy  bearing  measurements  of  a 
moving  target,  and  trying  to  estimate  the  target’s  position.  In  both  cases  the  specific 
equations  for  implementing  the  algorithm  were  derived,  and  used  to  augment  a  qualita¬ 
tive  analysis  of  the  density  behavior.  The  algorithm  was  then  used  to  calculate  an 
approximation  to  the  conditional  density,  which  in  turn  yielded  the  conditional  mean. 
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which  is  the  iniiumurii  variance  estimator.  This  approximation  to  the  optimal  estimate 
was  then  compared  to  appropriate  point  estimators  for  each  case,  showing  the  superior¬ 
ity  of  the  mean  of  the  approximate  density. 

We  obtain  several  benefits  from  tliis  combination  of  the  Bayesian  approach  and 
this  new  algorithm.  First,  one  can  develop  considerable  insight  into  system  behavior  by 
considering  tfie  problem  in  terms  of  the  probability  densities.  This  provides  a  unique 
perspective  not  available  otherwise.  The  algorithm  augments  this  by  providing  a  means 
of  calculating  the  densities  so  that  their  behavior  can  be  examined.  Second,  the  combi¬ 
nation  provides  a  benchmark  for  all  other  estimation  schemes.  Since  the  actual  density 
function  is  available,  the  true  optimum  estimate  for  any  given  loss  function  can  be  cal¬ 
culated.  Thus,  the  performance  of  any  other  estimator  can  be  compared  to  the  best 
possible  performance.  Third,  we  could  consider  actually  using  this  combination  to  pro¬ 
vide  an  optimal  estimator  for  online  use.  This  would  be  particularly  useful  in  cases 
where  the  transient  performance  is  critical.  The  algorithm  is  currently  fairly  expensive 
computationally,  but  no  attempt  had  been  made  to  take  advantage  of  its  parallelism. 
Customized  computer  architectures  would  undoubtedly  provide  immense  increases  in 
speed. 

In  a  broad  sense,  this  dissertation  shows  the  advantages  of  taking  the  Bayesian 
view,  and  provides  an  improved  tool  for  implementing  a  Bayesian  approach.  Taking  a 
Bayesian  view  can  lead  to  a  deeper  understanding  of  the  behavior  of  the  system,  and 
hence  to  a  better  appreciation  of  the  potentials  and  limitations  of  approximate  estima¬ 
tors.  At  best,  the  Bayesian  approach  can  actually  provide  the  optimal  estimate  for  a 
given  loss  function.  At  worst,  it  provides  a  valuable  benchmark  for  the  performance  of 
other  estimators. 
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